2007-02-20

Blogger index

I have been a little frustrated with the template limitations and design of blogger. I have checked out and started a Vox account and have been very impressed by what they have to offer but the customisation is even more restricted even though you have way more gizmos to play with. I still like Vox and will probably maintain the blog because of the community feel to it.

My main beef with Blogger is that it does not currently have a way to list all of the titles of your blog postings on one page. To get to old articles you either have to know the date you published and/or explore the archive or use the search engine. I believe this will have negative effects on users browsing your articles and possibly search engines that see the only way to get to the old articles is to follow multiple links down. When people ask me questions I want to find a posting really quickly, sure search is OK, but it feels messy and may miss, so having an index allows me to copy and send the link on very quickly.

To solve my needs I created a Python program to read the blog pages, scan the archives and create my own index page. I then encased it all in a script that would add headers and footers then upload the page to the right spot.

The more I use Python, the more I find it a very powerful programming language that is both easy and quick to write as well as read. For your programming pleasure I provide the code below. Feel free to copy/alter and make suggestions as my experience with Python is rather limited.

blogdex.py
# BLOGDEX
# Uses the latest template and requires the Blog Archive widget in flat list format
# - Simon Forsyth 20 Feb 2007, updated 11 Jul 2008

import urllib, sys, re

srcURL = "http://neoporcupine.blogspot.com"
outFileName = "blogdex.out"
logFileName = "blogdex.log"

log = open(logFileName,"w")
outFile = open(outFileName,"w")


def ExtractYM (thisURL):
# Returns the Year and Month specified in the URL in string format 2004/08
# http://neoporcupine.blogspot.com/2004/08/microsoft.html
myRE = re.compile (r'(\d\d\d\d/\d\d)')
thisMatch = myRE.search(thisURL)
if thisMatch:
return thisMatch.group(1)
else:
return "----/--"

def ProcessArchive (thisURL):
# This is the monthly/weekly/daily page with possibly multiple blog entries
print "Processing " + thisURL
print "****"
log.write ("Opening %s " % (thisURL))

# Read in source URL
f = urllib.urlopen(thisURL)
MainPage = f.read()
f.close
log.write ("success\n")

# Get titles
log.write ("Searching for titles\n")
myRE = re.compile (r'<h3 class=\'post-title entry-title\'>(.*?)</h3>', re.DOTALL)
myTitles = myRE.findall(MainPage)
myRE = re.compile(r'<a href=.*?>(.*)</a>', re.DOTALL)
# Remove any link from titles
for i in range(0,len(myTitles)):
myCheck = myRE.search(myTitles[i])
if myCheck:
myTitles[i] = myCheck.group(1).strip()
else:
myTitles[i] = myTitles[i].strip()
log.write ("\t%s\n" % (myTitles[i]))

# Get direct link to the article page by processing the link at the bottom of each article
log.write ("Extracting URL details\n")
myRE = re.compile (r"<span class='post-timestamp'>(.*?)</span>", re.DOTALL)
myURLs = myRE.findall(MainPage,re.DOTALL)

# Extract URL info
myRE = re.compile (r'href=\'(.*?)\'')
for i in range(0,len(myURLs)):
myURLs[i] = myRE.search(myURLs[i]).group(1)

if len(myURLs) != len(myTitles):
#Error of some sorts
log.write ("ERROR: number of titles and urls different")
for i in range(0,len(myTitles)):
log.write ( "%d %s\n" % (i,myTitles[i]) )
for i in range(0,len(myURLs)):
log.write ( "%d %s\n" % (i,myURLs[i]) )
else:
# Write date, and all title-URL to output
log.write ("Writing output\n")
outFile.write ( "<P><b>%s</b>\n" % (ExtractYM(myURLs[0])) )
for i in range(0,len(myTitles)):
outFile.write ( "<li><a href='%s'>%s</a></li>\n" % (myURLs[i],myTitles[i]) )

def processBlog(thissrcURL):
# Read in source URL
log.write ("**********************************************")
log.write ("Opening %s " % (thissrcURL))
f = urllib.urlopen(srcURL)
MainPage = f.read()
f.close
log.write ("success\n")

# Extract arhive URL section from the "Blog Archive" widget in flat format
myRE = re.compile (r'<div id=\'BlogArchive1_ArchiveList\'>(.*?)</div>', re.DOTALL)
myRawList = myRE.search(MainPage)
if myRawList:
log.write ("Found archive list in %s\n" % (thissrcURL))
# print myRawList.group(1)
else:
print "Could not find archive-list in main blog"
log.write ("Could not find archive-list in %s\n" % (thissrcURL))
sys.exit

# Convert the archive url section into an array of URLs
myRE = re.compile(r'href=\'(.*)\'>')
myArchiveURLs = myRE.findall(myRawList.group(1))

# Process each Archive URL
log.write('Found %i archive urls.\n' % (len(myArchiveURLs)))
for i in range(0,len(myArchiveURLs)):
ProcessArchive (myArchiveURLs[i])


processBlog (srcURL)
log.close
outFile.close


Then a nice little shell script (works with OS X) to scoop it all together and ftp off to your favourite website. This assumes that you have already created the header.html and footer.html that surrounds the blogdex.out output.
python blogdex.py
cat header.html > blogdex.html
cat blogdex.out >> blogdex.html
cat footer.html >> blogdex.html
# login to ftp assigning out put to END SCRIPT
ftp -n my.server.net << cp2web.log
user myusername mypassword
passive
put ~/myScripts/blogdex/blogdex.html /blogdex.html
quit

No comments: