Thursday, June 4, 2015

My First Slurp of Beautiful Soup

I have seen the movie Mary Poppins many times in my life. Her magical bag held lamps and all sort so ginormous objects. Her snapping fingers cleaned a room. Her umbrella helped her fly. She was magical.

And you know what else is magical? Beautiful Soup!

As Mary Poppins might say, a spoonful of Beautiful Soup makes the medicine go down...

The medicine, in this case, is the task of parsing HTML. Beautiful Soup will rescue you from the horrors of regex and help you navigate the DOM like a classy person.

I used Goodreads' Most Read Books in the U.S. as a platform for my first soupy adventure.

I wanted to scrape the book urls and book titles from all 50 books on the list.

A quick look at the source code reveals that each book has a link tag with a class of "bookTitle".

Recipe for Beautiful Soup
1. To use Beautiful Soup, simply install with pip install beautifulsoup4 and then import it at the top of your python file: from bs4 import BeautifulSoup

2. Create the soup.
book_soup = BeautifulSoup(page.content)

3. Generate a list of all segments of the DOM that start with: <a class="bookTitle"
book_links = book_soup.find_all("a", class_="bookTitle")

4. Use list comprehension to compose a list of book titles from the aforementioned list
    popular_book_titles = [book.find('span').get_text() for book in book_links]

5. Use list comprehension to compose a list of book links.
    popular_book_urls = [url + link.get('href') for link in book_links]

6. Ladle out the results.
    return (popular_book_titles, popular_book_urls)

No comments:

Post a Comment