Wednesday, March 18, 2009

Chapter 5 -- Session 5.9

In this post we get into a few more of the difficulties introduced by the new urllib. But first here is a little background. In Python 3.0 strings are encoded using an encoding mechanism called unicode. Unicode is capable of encoding the characters in many languages, even languages with thousands of symbols in their written language. Prior to 3.0 Python encoded strings using ASCII, the American Standard Code for Information Interchange.

The difference between ASCII and Unicode involves the number of bits that are required to store a symbol in memory. We're not going to get into the details here. If you are interested you can google it and spend a day reading up on Unicode. Suffice to say that unicode may take up to 4 bytes to store a single symbol whereas ASCII always uses just a single byte. Now when we read data over the internet we get a result that is stored as a list of bytes. If the characters coming from the website are encoded using ASCII everything looks normal. In our examples you can see that the website must be stored as ascii because everything looks normal.

In fact we can tell Python to decode the bytes into a string by using the decode method. The decode method takes a single parameter that specifies the method used to encode the bytes in the first place. So to return to our original test page.


>>> page = urllib.request.urlopen('http://www.cs.luther.edu/python/test.html')
>>> pageBytes = page.read()
>>> pageText = pageBytes.decode('ascii')
>>> pageText
'\n\n\n\t\n\tTest Page\n\t\n\t\n\t\n\n\n

Hello Python Programmer!

\n

This is a test page for the urllib2 module program

\n\n\n'
>>>

Now you can see that we have decoded the data from the website into a string that is convenient for us to work with.

Now the question is how do we know what encoding is used for a particular website? We can tell by looking, but Python can also tell us. Each object that is created by urlopen has an attribute called headers. The headers attribute contains lots of data about the data returned by the web server. One of the things stored in the headers is how the data was encoded. The example below shows us that the test webpage was encoded as 'iso-8859-1'. ISO-8859 is an extension to ASCII that includes characters for the standard western languages. it is fully compatible with ASCII and each character is stored as a single byte. So, we could decode the page using decode('iso-8859-1') as well.


>>> page.headers.get_content_charset()
'iso-8859-1'


To fully automate the decoding process we can simply do:

encoding = page.headers.get_content_charset()
pageText = pageBytes.decode(encoding)


Finally to return to Session 5.9 we can use the decode method on each bytes object in our list of lines.


>>> url1 = urllib.request.urlopen('http://ichart.finance.yahoo.com/table.csv?s=AAPL')
>>> t1Data = url1.readlines()
>>> t1Data[0].decode('ascii').split(',')
['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close\n']
>>> t1Data = [line.decode('ascii').split(',') for line in t1Data[1:]]
>>> t1Data[:3]
[['2009-03-17', '95.24', '99.69', '95.07', '99.66',
'28094500', '99.66\n'], ['2009-03-16', '96.53', '97.39',
'94.18', '95.42', '28473000', '95.42\n'], ['2009-03-13',
'96.30', '97.20', '95.01', '95.93', '21470300', '95.93\n']]



No comments:

Post a Comment