User Experience on mobile might not be great yet, but I'm working on it.

Your first time on this page? Allow me to give some explanations.

Awesome Information Retrieval

A curated list of awesome information retrieval resources

Here you can see meta information about this topic like the time we last updated this page, the original creator of the awesome list and a link to the original GitHub repository.

Last Update: Dec. 4, 2020, 3:14 a.m.

Thank you harpribot & contributors
View Topic on GitHub:

Search for resources by name or description.
Simply type in what you are looking for and the results will be filtered on the fly.

Further filter the resources on this page by type (repository/other resource), number of stars on GitHub and time of last commit in months.


C.D. Manning, P. Raghavan, H. Schütze. Cambridge UP, 2008. (First book for getting started with Information Retrieval).

Bruce Croft, Don Metzler, and Trevor Strohman. 2009. (Great book for readers interested in knowing how Search Engines work. The book is very detailed).

R. Baeza-Yates, B. Ribeiro-Neto. Addison-Wesley, 1999.

B. Croft, D. Metzler, T. Strohman. Pearson Education, 2009.

W.B. Croft, J. Lafferty. Springer, 2003. (Handles Language Modeling aspect of Information Retrieval. It also extensively details probabilistic perspective in this domain, which is interesting).

Ed Greengrass, 2000. (Comprehensive survey of Conventional Information Retrieval, before Deep Learning era).

C.T. Meadow, B.R. Boyce, D.H. Kraft, C.L. Barry. Academic Press, 2007 (library/information science perspective).


Vagelis Hristidis (University of California - Riverside).

Dr. Jilles Vreeken , Prof. Dr. Gerhard Weikum (MPI).

Prof. ChengXiang Zhai (University of Illinois at Urbana-Champaign).


Open Source Search Engine that can be used to test Information Retrieval Algorithm. Twitter uses this core for its real-time search.

The Lemur Project develops search engines, browser toolbars, text analysis tools, and data resources that support research and development of information retrieval and text mining software.

Another Open Source Search Engine competitor of Apache Lucene.

Open Source Toolkit for research in Language Modeling, filtering and categorization.

Standard IR Collections

This is one of the first collections in IR domain, however the dataset is too small for any statistical significance analysis, but is nevertheless suitable for pilot runs.

TREC is the benchmark dataset used by most IR and Web search algorithms. It has several tracks, each of which consists of dataset to test for a specific task. The tracks along with suggested use-case are:

Explore information seeking behavior in the blogosphere.

Address challenges in building large chemical testbeds for chemical IR.

Investigate techniques to link medical cases to information relevant for patient care.

Investigate search techniques for complex information needs (context and user interests based).

Explore crowdsourcing methods for performing and evaluating search.

Study search over the organization data.

Perform entity-related search (find entities and their properties) on Web data.

Binarily decide retrieval of new incoming documents given a stable information need.

Study merge performance for results from various search services.

Study retrieval efficiency of genomics data and corresponding documentation.

Obtain High Accuracy Retrieval from Documents by leveraging searcher's context.

Study user interaction with text retrieval systems.

Study algorithms that improve efficiency of human Knowledge Base.

Study retrieval systems that have high recall for legal documents use case.

Explore unstructured search performance over patients record data.

Examine satisfaction of real-time information need for microblogging sites.

Explore ad-hoc retrieval over large set of queries.

Investigate systems' abilities to locate new (non-redundant) information.

Test systems that scale beyond document retrieval, to retrieve answers to factoid, list and definition type questions.

For deep evaluation of relevance feedback processes.

Study individual topic's effectiveness.

Develop methods for measuring multiple-query sessions where information needs drift.

Benchmark spam filtering approaches.

Test if systems can induce possible tasks, users might be trying to accomplish for the query.

Develop systems that allow users to efficiently monitor the information associated with an event over time.

Test scalability of IR systems to large scale collection.

Explore information seeking behaviors common in general web search.

This is one of the largest Web collection of documents obtained from crawl of government websites by Charlie Clarke and Ian Soboroff, using NIST hardware and network, then formatted by Nick Craswel.

This is collection of wide variety of dataset ranging from Ad-hoc collection, Chinese IR collection, mobile clickthrough collections to medical collections. The focus of this collection is mostly on east asian languages and cross language information retrieval.

This dataset can be used for cross lingual IR between CJKE (Chinese-Japanese-Korean-English) languages. It is suitable for the following tasks:

The dataset is used for the task of cross-lingual question answering but the complexity of the task is higher than CLQA dataset.

It contains a multi-lingual document collection. The test suite includes:

This data set consists of 20000 newsgroup messages.posts taken from 20 newsgroup topics.

This data set is a comprehensive archive of English newswire text data including headlines, datelines and articles.

Past newswire/paper datasets (DUC 2001 - DUC 2007) are available upon request.

External Curation Links

Technical Talks

Tim Berners-Lee (Ted Talk) [Tim Berners-Lee invented the World Wide Web. He leads the World Wide Web Consortium (W3C), overseeing the Web's standards and development].

Steve Tjoa (RackSpace Developers) [This talk shows that IR is not just text and images].

Doug Imbruce (Techcrunch Disrupt)[Doug Imbruce is the Founder of Qwiki, Inc, a technology startup in New York, NY, acquired by Yahoo! in 2013].

Dr. Alma Whitten (Google Brussels Tech Talk).

Philosophical Talks

Andreas Ekström (Swedish Author & Journalist, TED Talk).

Eli Pariser (Author of the Filter Bubble, TED Talk).

Andy Yen (CERN, TED Talk) [This talk talks about privacy, which Search Engines intrude into, and how can people protect it].

Christopher "moot" Poole" (Ted Talks) [Christopher "moot" Poole is founder of 4chan, an online imageboard whose anonymous denizens have spawned the web's most bewildering and influential subculture].



Interesting Reads