This post is also available as an IPython notebook which may be downloaded or viewed online.
The sneeze count website catalogues one person’s sneezes since 2007. A work colleague asked how difficult it would be to extract a list of timestamps from the site. Could we then plot sneezes over time? This post shows how easy this is to do using Python.
Conveniently there is an RSS feed available for the site which can be fetched via Python’s built in urlopen function:
from urllib.request import urlopen
print('...')
print('\n'.join(urlopen('http://sneezecount.joyfeed.com/feed').read().decode('utf8').splitlines()[20:30]))
print('...')
...
<item>
<title>Four thousand and nineteen</title>
<link>http://sneezecount.joyfeed.com/four-thousand-and-nineteen/</link>
<comments>http://sneezecount.joyfeed.com/four-thousand-and-nineteen/#comments</comments>
<pubDate>Fri, 22 Aug 2014 12:41:25 +0000</pubDate>
<dc:creator><![CDATA[Peter]]></dc:creator>
<category><![CDATA[Sneezes]]></category>
<guid isPermaLink="false">http://sneezecount.joyfeed.com/?p=8608</guid>
<description><![CDATA[Seesaw, Twycross Zoo Moderate to strong “I’m not heavy enough. And you’re too heavy.”]]></description>
...
There is also a paged
query parameter whch lets one get the next page of results:
from urllib.request import urlopen
print('...')
print('\n'.join(urlopen('http://sneezecount.joyfeed.com/feed?paged=2').read().decode('utf8').splitlines()[20:30]))
print('...')
...
<item>
<title>Four thousand and nine</title>
<link>http://sneezecount.joyfeed.com/four-thousand-and-nine/</link>
<comments>http://sneezecount.joyfeed.com/four-thousand-and-nine/#comments</comments>
<pubDate>Sat, 09 Aug 2014 17:09:39 +0000</pubDate>
<dc:creator><![CDATA[Peter]]></dc:creator>
<category><![CDATA[Sneezes]]></category>
<guid isPermaLink="false">http://sneezecount.joyfeed.com/?p=8587</guid>
<description><![CDATA[Dining room, Malt Barn, Brize Norton Moderate Accepting the offer of an elderflower Prosecco]]></description>
...
Python also has a built in XML parsing module, the ElementTree module. Usage is, ahem, elementary:
import xml.etree.ElementTree as ET
def parse_xml_url(url):
tree = ET.parse(urlopen(url))
return tree.getroot()
root = parse_xml_url('http://sneezecount.joyfeed.com/feed?paged=2')
print(root)
<Element 'rss' at 0x7f958d5ccf70>
An RSS feed encodes each post with an item
tag and each publication date within the item as a pubDate
. A quick and dirty hack to extract all dates from an RSS feed is simply to list all the pubDate
elements. According to WordPress’s documentation, the pubDate
tages are RFC822 dates. Python has a module to deal with that too! The email.utils module has a parsedate
function. The value can be passed directly to time.mktime()
to get a timestamp.
from email.utils import parsedate
import time
def extract_dates(root):
return list(time.mktime(parsedate(elem.text)) for elem in root.iter('pubDate'))
print(', '.join(str(t) for t in extract_dates(root)))
1407600579.0, 1407520308.0, 1407389858.0, 1407381140.0, 1407344647.0, 1407255800.0, 1407141500.0, 1407069426.0, 1407009164.0, 1406970003.0
To get dates for a given feed, therefore, we need simply page through the feed starting at page 1 until we get a result with no items.
from urllib.parse import urljoin
from urllib.request import HTTPError
def dates_from_feed_url(url):
paged = 0
all_dates = []
while True:
paged += 1
page_url = urljoin(url, '?paged={0}'.format(paged))
try:
dates = extract_dates(parse_xml_url(page_url))
except HTTPError:
# Interpret a HTTP error (e.g. 404) as us reaching the end of the list
break
all_dates.extend(dates)
if len(dates) == 0:
break
return all_dates
# Test with a feed URL for a single month:
ts = dates_from_feed_url('http://sneezecount.joyfeed.com/2014/08/feed')
print(', '.join(str(t) for t in ts))
1408707685.0, 1408628545.0, 1408563641.0, 1408537639.0, 1408439866.0, 1408275986.0, 1408256591.0, 1408167669.0, 1408131465.0, 1407820127.0, 1407600579.0, 1407520308.0, 1407389858.0, 1407381140.0, 1407344647.0, 1407255800.0, 1407141500.0, 1407069426.0, 1407009164.0, 1406970003.0, 1406870597.0
We can get all dates for all time by using the full feed URL:
timestamps = dates_from_feed_url('http://sneezecount.joyfeed.com/feed')
If we look at the first page of the website, as of writing there are four thousand and nineteen sneezes. Let’s just ckeck that we’ve got them all:
print('Fetched {0} sneeze timestamps'.format(len(timestamps)))
Fetched 4019 sneeze timestamps
Let’s use matplotlib to plot the timestamps. We first use some IPython magic to load the pylab environment:
%pylab inline
rcParams['figure.figsize'] = (14,9) # Set the default figure size
Populating the interactive namespace from numpy and matplotlib
plot(timestamps, np.arange(len(timestamps)))
title('Sneezes over time')
xlabel('Timestamp')
ylabel('Sneezes')
grid('on')
Possibly more interesting is a histogram of intervals between sneezes:
# Use three hour bins out to 5 days
hist(np.diff(np.asarray(timestamps) / (60*60)), bins=np.arange(0, 5*24, 3))
title('Histogram of times between sneezes')
xlabel('Hours')
ylabel('Count')
grid('on')
In this post we showed how the Python standard library has all the tools we require to scrape a website and plot some interesting figures using data from it.