The difference between ASCII and Unicode involves the number of bits that are required to store a symbol in memory. We're not going to get into the details here. If you are interested you can google it and spend a day reading up on Unicode. Suffice to say that unicode may take up to 4 bytes to store a single symbol whereas ASCII always uses just a single byte. Now when we read data over the internet we get a result that is stored as a list of bytes. If the characters coming from the website are encoded using ASCII everything looks normal. In our examples you can see that the website must be stored as ascii because everything looks normal.
In fact we can tell Python to decode the bytes into a string by using the decode method. The decode method takes a single parameter that specifies the method used to encode the bytes in the first place. So to return to our original test page.
>>> page = urllib.request.urlopen('http://www.cs.luther.edu/python/test.html')
>>> pageBytes = page.read()
>>> pageText = pageBytes.decode('ascii')
>>> pageText
'\n\n\n\t\n\tTest Page \n\t\n\t\n\t\n\n\nHello Python Programmer!
\nThis is a test page for the urllib2 module program
\n\n\n'
>>>
Now you can see that we have decoded the data from the website into a string that is convenient for us to work with.
Now the question is how do we know what encoding is used for a particular website? We can tell by looking, but Python can also tell us. Each object that is created by urlopen has an attribute called headers. The headers attribute contains lots of data about the data returned by the web server. One of the things stored in the headers is how the data was encoded. The example below shows us that the test webpage was encoded as 'iso-8859-1'. ISO-8859 is an extension to ASCII that includes characters for the standard western languages. it is fully compatible with ASCII and each character is stored as a single byte. So, we could decode the page using decode('iso-8859-1') as well.
>>> page.headers.get_content_charset()
'iso-8859-1'
To fully automate the decoding process we can simply do:
encoding = page.headers.get_content_charset()
pageText = pageBytes.decode(encoding)
Finally to return to Session 5.9 we can use the decode method on each bytes object in our list of lines.
>>> url1 = urllib.request.urlopen('http://ichart.finance.yahoo.com/table.csv?s=AAPL')
>>> t1Data = url1.readlines()
>>> t1Data[0].decode('ascii').split(',')
['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close\n']
>>> t1Data = [line.decode('ascii').split(',') for line in t1Data[1:]]
>>> t1Data[:3]
[['2009-03-17', '95.24', '99.69', '95.07', '99.66',
'28094500', '99.66\n'], ['2009-03-16', '96.53', '97.39',
'94.18', '95.42', '28473000', '95.42\n'], ['2009-03-13',
'96.30', '97.20', '95.01', '95.93', '21470300', '95.93\n']]
No comments:
Post a Comment