Implementing a Basic Full-Text Search Engine in Python
Full-text search is a crucial feature in many applications, allowing users to quickly find relevant information within a large body of text. This challenge asks you to implement a simplified full-text search engine in Python, focusing on indexing and searching a collection of documents. The goal is to build a system that can efficiently locate documents containing specific search terms.
Problem Description
You are tasked with creating a basic full-text search engine. The engine should consist of two primary components: an indexer and a searcher.
Indexer:
- Takes a list of documents (strings) as input.
- Creates an inverted index. An inverted index maps each unique word in the documents to a list of document IDs (indices in the original list) where that word appears.
- The index should be stored in a suitable data structure (e.g., a dictionary).
- Words should be converted to lowercase before indexing.
- Punctuation should be removed from words before indexing.
Searcher:
- Takes the inverted index and a search query (string) as input.
- Converts the search query to lowercase and removes punctuation.
- Splits the query into individual words.
- Looks up each word in the inverted index.
- Returns a set of document IDs that contain all the search query words. This is an AND search.
- If a word in the query is not found in the index, the search should return an empty set.
Expected Behavior:
The search engine should accurately identify documents containing all the specified search terms. The results should be returned as a set of document IDs, ensuring uniqueness and order independence.
Edge Cases to Consider:
- Empty documents: Handle cases where a document is an empty string.
- Empty query: Handle cases where the search query is an empty string.
- Documents with only punctuation: Handle documents that contain only punctuation.
- Case sensitivity: The search should be case-insensitive.
- Punctuation: Punctuation should be removed before indexing and searching.
- Words not in the index: Handle search queries containing words not present in any of the documents.
Examples
Example 1:
Input: documents = ["The quick brown fox.", "The slow blue fox.", "The quick red rabbit."]
Input: query = "quick fox"
Output: {0, 1}
Explanation: Document 0 and Document 1 both contain the words "quick" and "fox". Document 2 contains "quick" but not "fox".
Example 2:
Input: documents = ["This is a test.", "This is another test.", "And yet another."]
Input: query = "is test"
Output: {0, 1}
Explanation: Documents 0 and 1 both contain "is" and "test".
Example 3: (Edge Case)
Input: documents = ["The quick brown fox.", "The slow blue fox.", "The quick red rabbit."]
Input: query = "purple elephant"
Output: set()
Explanation: Neither "purple" nor "elephant" appear in any of the documents.
Example 4: (Edge Case - Empty Document)
Input: documents = ["The quick brown fox.", "", "The quick red rabbit."]
Input: query = "quick"
Output: {0, 2}
Explanation: The empty string document is ignored.
Constraints
- The number of documents will be between 1 and 1000.
- Each document will be a string with a maximum length of 1000 characters.
- The search query will be a string with a maximum length of 100 characters.
- The words in the documents and the query will consist of alphanumeric characters and spaces.
- Performance: The indexing process should complete within 1 second for the given constraints. The search process should complete within 0.5 seconds.
Notes
- Consider using Python's built-in string methods for case conversion and punctuation removal.
- The inverted index is the core data structure for efficient searching. Choose a data structure that allows for fast lookups.
- Focus on clarity and readability in your code.
- This is a simplified full-text search engine. More advanced features like stemming, stop word removal, and ranking are not required for this challenge.
- The order of the document IDs in the output set is not important.