Python Programming in Context: March 2009

Wednesday, March 18, 2009

Chapter 5 -- Page 162 -- Table 5.3

The fifth example in Table 5.3 should read %20.2f to be consistent with the description.

Chapter 5 -- Listing 5.6

Now that we understand how to decode bytes into strings we only need to make one small change to Listing 5.6


def stockCorrelate(ticker1, ticker2):
    url1 = urllib.request.urlopen('http://ichart.finance.yahoo.com/table.csv?s=%s'%ticker1)
    url2 = urllib.request.urlopen('http://ichart.finance.yahoo.com/table.csv?s=%s'%ticker2)    
    t1Data = url1.readlines()
    t2Data = url2.readlines()
    t1Data = [line.decode('ascii').split(',') for line in t1Data[1:] ]
    t2Data = [line.decode('ascii').split(',') for line in t2Data[1:] ]
    t1Close = []
    t2Close = []
    for i in range(min(len(t1Data), len(t2Data))):
        if t1Data[i][0] == t2Data[i][0]:
            t1Close.append(float(t1Data[i][4]))
            t2Close.append(float(t2Data[i][4]))
    
    print(len(t1Close), len(t2Close))
    return correlation(t1Close, t2Close)

In this post we get into a few more of the difficulties introduced by the new urllib. But first here is a little background. In Python 3.0 strings are encoded using an encoding mechanism called unicode. Unicode is capable of encoding the characters in many languages, even languages with thousands of symbols in their written language. Prior to 3.0 Python encoded strings using ASCII, the American Standard Code for Information Interchange.

The difference between ASCII and Unicode involves the number of bits that are required to store a symbol in memory. We're not going to get into the details here. If you are interested you can google it and spend a day reading up on Unicode. Suffice to say that unicode may take up to 4 bytes to store a single symbol whereas ASCII always uses just a single byte. Now when we read data over the internet we get a result that is stored as a list of bytes. If the characters coming from the website are encoded using ASCII everything looks normal. In our examples you can see that the website must be stored as ascii because everything looks normal.

In fact we can tell Python to decode the bytes into a string by using the decode method. The decode method takes a single parameter that specifies the method used to encode the bytes in the first place. So to return to our original test page.


>>> page = urllib.request.urlopen('http://www.cs.luther.edu/python/test.html')
>>> pageBytes = page.read()
>>> pageText = pageBytes.decode('ascii')
>>> pageText
'\n\n\n\t\n\tTest Page\n\t\n\t\n\t\n\n\nHello Python Programmer!
\nThis is a test page for the urllib2 module program
\n\n\n'
>>>

Now you can see that we have decoded the data from the website into a string that is convenient for us to work with.

Now the question is how do we know what encoding is used for a particular website? We can tell by looking, but Python can also tell us. Each object that is created by urlopen has an attribute called headers. The headers attribute contains lots of data about the data returned by the web server. One of the things stored in the headers is how the data was encoded. The example below shows us that the test webpage was encoded as 'iso-8859-1'. ISO-8859 is an extension to ASCII that includes characters for the standard western languages. it is fully compatible with ASCII and each character is stored as a single byte. So, we could decode the page using decode('iso-8859-1') as well.


>>> page.headers.get_content_charset()
'iso-8859-1'

To fully automate the decoding process we can simply do:


encoding = page.headers.get_content_charset()
pageText = pageBytes.decode(encoding)

Finally to return to Session 5.9 we can use the decode method on each bytes object in our list of lines.


>>> url1 = urllib.request.urlopen('http://ichart.finance.yahoo.com/table.csv?s=AAPL')
>>> t1Data = url1.readlines()
>>> t1Data[0].decode('ascii').split(',')
['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close\n']
>>> t1Data = [line.decode('ascii').split(',') for line in t1Data[1:]]
>>> t1Data[:3]
[['2009-03-17', '95.24', '99.69', '95.07', '99.66',
 '28094500', '99.66\n'], ['2009-03-16', '96.53', '97.39',
 '94.18', '95.42', '28473000', '95.42\n'], ['2009-03-13',
 '96.30', '97.20', '95.01', '95.93', '21470300', '95.93\n']]

Chapter 5 -- Session 5.8

For completeness, Session 5.8 should look like this:


>>> u = urllib.request.urlopen('http://ichart.finance.yahoo.com/table.csv?s=TGT')
>>> u.readlines()[:10]
[b'Date,Open,High,Low,Close,Volume,Adj Close\n',
 b'2009-03-17,29.44,30.45,29.13,30.45,10204800,30.45\n',
 b'2009-03-16,30.30,30.44,28.75,28.83,11136300,28.83\n',
 b'2009-03-13,28.69,30.01,28.18,29.97,16870100,29.97\n',
 b'2009-03-12,26.89,28.69,26.60,28.49,12663900,28.49\n',
 b'2009-03-11,27.25,27.85,26.76,26.90,15474000,26.90\n',
 b'2009-03-10,25.75,27.66,25.46,27.21,13459800,27.21\n',
 b'2009-03-09,25.40,26.35,25.13,25.37,11816500,25.37\n',
 b'2009-03-06,26.63,26.79,25.00,25.65,12434600,25.65\n',
 b'2009-03-05,26.88,27.82,26.04,26.31,13311700,26.31\n']

Chapter 5 -- Session 5.7

One big change in Python 3.0 that we did not notice until after the book had gone to press was a change to the module for reading from the Internet. The urllib module was changed substantially.

As you see from the session below the urlopen function is no longer a part of urllib, it is now a part of urllib.request.



>>> import urllib
>>> page = urllib.request.urlopen('http://www.cs.luther.edu/python/test.html')
Traceback (most recent call last):
  File "", line 1, in 
    page = urllib.request.urlopen('http://www.cs.luther.edu/python/test.html')
AttributeError: 'module' object has no attribute 'request'
>>> import urllib.request
>>> page = urllib.request.urlopen('http://www.cs.luther.edu/python/test.html')
>>> pageText = page.read()
>>> pageText
b'\n\n\n\t\n\tTest Page\n\t\n\t\n\t\n\n\nHello Python Programmer!
\nThis is a test page for the urllib2 module program
\n\n\n'
>>> type(pageText)

If simply moving the urlopen function to urllib.request was the only change that would not have been too bad. The more difficult change is the very subtle addition of the b before the quotes in the pageText string. In fact you can see that the variable pageText refers to something that is called bytes.

The good news is that bytes objects act very similarly to strings. The bad news is that you cannot simply mix and match strings with bytes.

The session below illustrates the difficulty:


>>> 'foo' + b'bar'
Traceback (most recent call last):
  File "", line 1, in 
    'foo' + b'bar'
TypeError: Can't convert 'bytes' object to str implicitly

We will work through these differences in subsequent posts about the rest of chapter 5.

Tuesday, March 17, 2009

Chapter 4 -- Page 151 -- Problem 4.46

We felt that problem 4.46 deserved a bit more detail so we added it as programming exercise 4.2. Unfortunately, we forgot to delete 4.46. To understand more about regression lines, refer to programming exercise 4.2.

Tuesday, March 3, 2009

Chapter 3 -- Listing 3.12 Vignere cipher

Line 5 of Listing 3.12 contains the statement charNum = 0. The variable charNum is not used anywhere in the encryptVignere function and the line is not needed.

Line 2 is also extraneous. It contains the template docstring for the encryptVignere function.

Python Programming in Context

Wednesday, March 18, 2009

Chapter 5 -- Page 162 -- Table 5.3

Chapter 5 -- Listing 5.6

Chapter 5 -- Session 5.9

Hello Python Programmer!

Chapter 5 -- Session 5.8

Chapter 5 -- Session 5.7

Hello Python Programmer!

Tuesday, March 17, 2009

Chapter 4 -- Page 151 -- Problem 4.46

Tuesday, March 3, 2009

Chapter 3 -- Listing 3.12 Vignere cipher

Welcome

Notes and Fixes by Tag

Blog Archive

Followers