Asynchronous Web Scraper with Async/Await
This challenge focuses on implementing asynchronous operations in Python using async and await to efficiently scrape data from multiple websites concurrently. Asynchronous programming is crucial for I/O-bound tasks like web scraping, as it allows your program to continue processing while waiting for network responses, significantly improving performance. You'll build a simple scraper that fetches the titles of several web pages.
Problem Description
You are tasked with creating an asynchronous web scraper that fetches the titles of a list of URLs. The scraper should use the aiohttp library for making asynchronous HTTP requests and async/await to handle the requests concurrently. The function should take a list of URLs as input and return a list of titles corresponding to each URL. If a request fails for a particular URL, the function should gracefully handle the error and return "Error fetching title" for that URL in the output list.
Key Requirements:
- Use
aiohttpfor asynchronous HTTP requests. - Implement
asyncandawaitcorrectly to handle asynchronous operations. - Handle potential errors during the request process (e.g., network errors, invalid URLs).
- Return a list of titles in the same order as the input URLs.
- If a URL cannot be fetched, return "Error fetching title" for that URL.
Expected Behavior:
The function should take a list of URLs as input. It should then concurrently fetch the HTML content of each URL and extract the title from the HTML. The function should return a list containing the titles of the fetched pages, in the same order as the input URLs. Error handling should ensure that the program doesn't crash if a URL is unreachable or invalid.
Edge Cases to Consider:
- Invalid URLs (e.g., malformed URLs).
- Network errors (e.g., connection timeouts, DNS resolution failures).
- Websites that don't return a standard HTML title tag.
- Empty input list.
Examples
Example 1:
Input: ["https://www.example.com", "https://www.python.org", "https://www.google.com"]
Output: ["Example Domain", "Welcome to Python.org", "Google"]
Explanation: The function successfully fetches the titles from each website and returns them in a list.
Example 2:
Input: ["https://www.example.com", "https://invalid-url", "https://www.python.org"]
Output: ["Example Domain", "Error fetching title", "Welcome to Python.org"]
Explanation: The function fetches the title from example.com and python.org, but fails to fetch from the invalid URL, returning the error message.
Example 3:
Input: []
Output: []
Explanation: An empty input list results in an empty output list.
Constraints
- The input list of URLs will contain strings.
- Each URL string will be a valid URL format.
- The function should complete within 5 seconds for a list of 10 URLs.
- You must use
aiohttpfor making HTTP requests. - The function should be asynchronous (defined with
async def).
Notes
- You'll need to install
aiohttp:pip install aiohttp - Consider using
try...exceptblocks to handle potential errors during the request process. - The
BeautifulSouplibrary can be helpful for parsing HTML and extracting the title tag, but it's not strictly required. You can use regular expressions or other string manipulation techniques if you prefer. - Focus on correctly implementing
asyncandawaitto achieve concurrency. The specific HTML parsing method is less important. - Remember to use
asyncio.run()to execute the asynchronous function.