And you know what else is magical? Beautiful Soup!
As Mary Poppins might say, a spoonful of Beautiful Soup makes the medicine go down...
The medicine, in this case, is the task of parsing HTML. Beautiful Soup will rescue you from the horrors of regex and help you navigate the DOM like a classy person.
I used Goodreads' Most Read Books in the U.S. as a platform for my first soupy adventure.
I wanted to scrape the book urls and book titles from all 50 books on the list.
A quick look at the source code reveals that each book has a link tag with a class of "bookTitle".
Recipe for Beautiful Soup
1. To use Beautiful Soup, simply install with pip install beautifulsoup4 and then import it at the top of your python file: from bs4 import BeautifulSoup
2. Create the soup.
book_soup = BeautifulSoup(page.content)
3. Generate a list of all segments of the DOM that start with: <a class="bookTitle"
book_links = book_soup.find_all("a", class_="bookTitle")
4. Use list comprehension to compose a list of book titles from the aforementioned list
popular_book_titles = [book.find('span').get_text() for book in book_links]
5. Use list comprehension to compose a list of book links.
popular_book_urls = [url + link.get('href') for link in book_links]
6. Ladle out the results.
return (popular_book_titles, popular_book_urls)
No comments:
Post a Comment