Thursday, June 4, 2015

My First Slurp of Beautiful Soup

I have seen the movie Mary Poppins many times in my life. Her magical bag held lamps and all sort so ginormous objects. Her snapping fingers cleaned a room. Her umbrella helped her fly. She was magical.

And you know what else is magical? Beautiful Soup!

As Mary Poppins might say, a spoonful of Beautiful Soup makes the medicine go down...

The medicine, in this case, is the task of parsing HTML. Beautiful Soup will rescue you from the horrors of regex and help you navigate the DOM like a classy person.

I used Goodreads' Most Read Books in the U.S. as a platform for my first soupy adventure.

I wanted to scrape the book urls and book titles from all 50 books on the list.

A quick look at the source code reveals that each book has a link tag with a class of "bookTitle".

Recipe for Beautiful Soup
1. To use Beautiful Soup, simply install with pip install beautifulsoup4 and then import it at the top of your python file: from bs4 import BeautifulSoup

2. Create the soup.
book_soup = BeautifulSoup(page.content)

3. Generate a list of all segments of the DOM that start with: <a class="bookTitle"
book_links = book_soup.find_all("a", class_="bookTitle")

4. Use list comprehension to compose a list of book titles from the aforementioned list
    popular_book_titles = [book.find('span').get_text() for book in book_links]

5. Use list comprehension to compose a list of book links.
    popular_book_urls = [url + link.get('href') for link in book_links]

6. Ladle out the results.
    return (popular_book_titles, popular_book_urls)

Tuesday, June 2, 2015

Goodreads API + Python = An Adventure

I've been playing around with the Goodreads for about a week. Here are some things I learned the hard way:

-Use rauth. I tried using oauth2 but kept getting a "Invalid OAuth Request" message. Then I realized it was a problem with the library. I somehow missed this sentence on the API documentation: We recommend using rauth instead of oauth2 since oauth2 hasn't been updated in quite a while.

-According to a developer message board on Goodreads, the Goodreads API only supports OAuth1 (for the indefinite future), so don't accidentally start using/reading the OAuth2Services or OAuth2Sessions sections of the rauth docs.

-Look at this. Very, very carefully. It will help you set up your request token so that your app's user can grant you access to make changes to the user's Goodreads account.

The examples on the aforementioned resource concerned setting up an OAuth1Session and adding a book to a user's shelf. Both very important, but nothing about how to GET information rather than POST information from/to a user's account.

Getting the user ID proved to be a challenge. Once a user has granted you access through rauth, the Goodreads API documentation says:

Basically this explains that it returns an xml response with the Goodreads user_id. However, the response will consist of more than just the user_id...you will have to do a bit of digging to find that exact information.

I created a function called get_user_id() that uses the GET url to get the xml object
user_stuff = session.get('/api/auth_user.xml')

Then use parseString (make sure you include this import statement at the top: from xml.dom.minidom import parseString) to parse the xml.

The important part is that you need to parse the xml content, not the xml object. 
xml_stuff = parseString(user_stuff.content) 

Now you're ready to getElementsByTagName. Thanks to Steve Kertes's code on github, I was able to figure out that user_id = xml_stuff.getElementsByTagName('user')[0].attributes['id].value

Now just return the str(user_id) and you're done! High five yourself. One small step for API experts, one giant leap for me. 

I owe a million thanks to this code, this, and this. Where would I be without the Internet???