Hone logo
Hone
Problems

Asynchronous Web Scraping with aiohttp

This challenge focuses on building a simple asynchronous web scraper using the aiohttp library in Python. Asynchronous programming allows for efficient handling of multiple network requests concurrently, significantly improving performance when dealing with numerous web pages. You'll be tasked with fetching content from a list of URLs and extracting specific data.

Problem Description

You are required to write a Python function that takes a list of URLs as input and asynchronously fetches the HTML content of each URL using aiohttp. The function should then extract the title of each webpage and return a dictionary where the keys are the URLs and the values are the corresponding page titles. If a URL cannot be accessed (e.g., due to a network error or invalid URL), the value for that URL in the dictionary should be None. The function must use aiohttp for asynchronous requests.

Key Requirements:

  • Asynchronous Requests: Utilize aiohttp's asynchronous capabilities to fetch multiple pages concurrently.
  • Error Handling: Gracefully handle potential errors during the request process (e.g., connection errors, timeouts, invalid URLs). Set the value to None in case of an error.
  • Title Extraction: Extract the <title> tag content from the HTML of each page.
  • Return Value: Return a dictionary mapping URLs to their corresponding titles (or None if an error occurred).
  • Proper Resource Management: Ensure that the aiohttp.ClientSession is properly closed after all requests are completed.

Expected Behavior:

The function should:

  1. Create an aiohttp.ClientSession.
  2. Iterate through the list of URLs.
  3. For each URL, make an asynchronous GET request.
  4. If the request is successful, extract the title from the HTML content.
  5. If the request fails, set the title to None.
  6. Store the URL and title (or None) in a dictionary.
  7. Close the aiohttp.ClientSession.
  8. Return the dictionary.

Edge Cases to Consider:

  • Invalid URLs: URLs that are malformed or do not point to valid web pages.
  • Network Errors: Temporary network issues that prevent access to a URL.
  • Timeouts: Requests that take too long to complete.
  • Empty Titles: Web pages that do not have a <title> tag.
  • Non-200 Status Codes: Responses with status codes other than 200 (OK). Treat these as errors.

Examples

Example 1:

Input: ["https://www.example.com", "https://www.python.org"]
Output: {"https://www.example.com": "Example Domain", "https://www.python.org": "Welcome to Python.org"}
Explanation: The function successfully fetches the content of both URLs and extracts their titles.

Example 2:

Input: ["https://www.example.com", "https://invalid-url.xyz"]
Output: {"https://www.example.com": "Example Domain", "https://invalid-url.xyz": None}
Explanation: The function fetches the content of example.com successfully, but fails to access the invalid URL, resulting in a `None` value for that URL.

Example 3: (Edge Case - Empty Title)

Input: ["https://www.wikipedia.org"]
Output: {"https://www.wikipedia.org": "Wikipedia"}
Explanation: Wikipedia has a title.

Constraints

  • The input list of URLs will contain strings.
  • The number of URLs in the input list can range from 1 to 100.
  • The function should complete within 10 seconds for a list of 100 URLs.
  • The function should handle exceptions gracefully and not crash.
  • The aiohttp.ClientSession should be created and closed within the function.

Notes

  • Consider using asyncio.gather to concurrently fetch the content of multiple URLs.
  • Use try...except blocks to handle potential errors during the request process.
  • The BeautifulSoup library is not required for this challenge. You can use regular expressions or string manipulation to extract the title. However, using BeautifulSoup is acceptable if you prefer.
  • Focus on the asynchronous aspects of aiohttp and proper error handling.
  • Remember to await asynchronous operations.
Loading editor...
python