Saturday, October 24, 2015

Using Selenium + Python for Scraping Sites with Usernames

I've been playing around with scraping data from a website that requires a username and password to view information related to my profile.

Until recently, I was able to log in by using code like the below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
    driver = webdriver.Chrome()
    # log in
    link = (insert login page link here as a string)
    driver.get(link)
    userbox = driver.find_element_by_id("Username")
    userbox.send_keys(myuser)
    passbox = driver.find_element_by_id("Password")
    passbox.send_keys(mypass)
    passbox.send_keys(Keys.RETURN)

However, the site recently implemented a CAPTCHA hurdle into its login process, which means that the above code no longer works.




Firefox Profiles to the rescue!

In the course of my research, I learned that Selenium pros tend to prefer using custom profiles for faster page loads anyway, so maybe this was a blessing in disguise. Plus, I learned something new!

How to Bypass a CAPTCHA/Log-in Page With Selenium WebDriver

First, create a Firefox profile. 
What's a Firefox profile, you ask? Mozilla says: "Firefox saves your personal information such as bookmarks, passwords, and user preferences in a set of files called your profile, which is stored in a separate location from the Firefox program files."

You have a default profile already, but let's create one just in case you want to test Selenium with different settings than you normally use.

In Terminal, run /Applications/Firefox.app/Contents/MacOS/firefox-bin -P

You will be led to a window that directs you to set up a profile in a new clean instance of the Firefox app. This will appear on your dock way under your other app icons. Now make sure to log in to the site you want to test on, and it will create a cookie that saves your password so next time you visit using Selenium you will be able to bypass the login and CAPTCHA test.

Reminder: Make sure to log in on Firefox using this special profile* (by running the Terminal command mentioned above) because cookies expire and your Firefox profile won't be able to access the page if you haven't logged in recently enough. 

*You can also check the box that says "use the selected profile without asking at startup" if you want to just use this profile all the time, not just for Selenium stuff.

Update Your Code
It's time to update your code to include this Firefox profile. Depending on where your python file is located, you should update the path accordingly. The below assumes that your code is located in a folder that's two levels below your Library folder. Don't worry, the space between "Application" and "Support" is not a problem. I've highlighted the areas of code that need to be personalized based on your specific needs. 

    profile = webdriver.FirefoxProfile('../../Library/Application                  Support/Firefox/Profiles/yourprofilenamehere')
    driver = webdriver.Firefox(firefox_profile=profile)
    driver.get(link)

Happy scraping!


No comments:

Post a Comment