Rabin-Karp String Search in JavaScript
The Rabin-Karp algorithm is a string searching algorithm that uses hashing to find any one of a set of pattern strings in a text. It's particularly useful when searching for multiple patterns within a single text, as the text's hash values only need to be computed once. Your task is to implement the Rabin-Karp algorithm in JavaScript to efficiently search for a pattern string within a larger text string.
Problem Description
You are required to implement a function rabinKarp(text, pattern) that searches for the first occurrence of a given pattern string within a larger text string using the Rabin-Karp algorithm. The function should return the starting index of the first occurrence of the pattern in the text. If the pattern is not found, the function should return -1.
Key Requirements:
- Hashing: Implement a rolling hash function to efficiently calculate hash values for substrings of the text. A simple polynomial hash function is recommended.
- Rolling Hash: The hash function should be designed to allow for efficient "rolling" updates as you slide the window across the text. This means calculating the hash of the next substring should be relatively inexpensive based on the previous substring's hash.
- Collision Handling: While the Rabin-Karp algorithm relies on hashing, collisions (different strings having the same hash value) are possible. Your implementation must include a character-by-character comparison when a hash match is found to verify that it's a true match and not a collision.
- Prime Number: Use a prime number for the modulus in your hash function to minimize collisions.
- Base: Use a base (radix) greater than the size of the alphabet (e.g., 256 for ASCII).
Expected Behavior:
The function should accurately identify the starting index of the first occurrence of the pattern within the text. It should handle cases where the pattern is not found, and it should correctly handle edge cases such as empty strings or patterns longer than the text.
Edge Cases to Consider:
- Empty text string.
- Empty pattern string.
- Pattern string longer than the text string.
- Pattern string not found in the text.
- Pattern string appearing multiple times in the text (return the first occurrence).
- Text and pattern containing special characters.
Examples
Example 1:
Input: text = "ABABDABACDABABCABAB", pattern = "ABABCABAB"
Output: 10
Explanation: The pattern "ABABCABAB" starts at index 10 in the text "ABABDABACDABABCABAB".
Example 2:
Input: text = "GEEKS FOR GEEKS", pattern = "GEEK"
Output: 0
Explanation: The pattern "GEEK" starts at index 0 in the text "GEEKS FOR GEEKS".
Example 3:
Input: text = "AAAAA", pattern = "BA"
Output: -1
Explanation: The pattern "BA" is not found in the text "AAAAA".
Example 4:
Input: text = "", pattern = "A"
Output: -1
Explanation: The text is empty, so the pattern cannot be found.
Example 5:
Input: text = "ABC", pattern = ""
Output: 0
Explanation: An empty pattern is considered to be found at the beginning of any string.
Constraints
- The length of the text string (
text) will be between 0 and 100,000 characters (inclusive). - The length of the pattern string (
pattern) will be between 0 and 10,000 characters (inclusive). - Both
textandpatternwill consist of ASCII characters. - The algorithm should have a time complexity of O(n + m) in the average case, where n is the length of the text and m is the length of the pattern. While worst-case performance is possible due to collisions, your implementation should be optimized to minimize collisions.
Notes
- Consider using a prime number (e.g., 101, 31, 13) for the modulus in your hash function.
- A base of 256 is suitable for ASCII characters.
- Remember to handle collisions by comparing the actual characters when hash values match.
- The rolling hash function is the key to the efficiency of the Rabin-Karp algorithm. Think carefully about how to update the hash value as you slide the window.
- Focus on clarity and readability in your code. Good variable names and comments will be helpful.
- Test your code thoroughly with various inputs, including edge cases.