note to self....

http://blogs.earthside.org/note_to_self/

Monday, October 31, 2005

azlyricfilter.pl - strip HTML from azlyrics.com files

#!/usr/bin/perl -w
#
# azlyricfilter.pl - filters HTML from www.azlyrics.com into text
#
# Run this on the output of 'html2text -nobs azlyricsURL e.g.
#
# html2text -nobs http://www.azlyrics.com/lyrics/u2/exit.html\
# | azlyricfilter.pl > u2-exit.txt
#
# The -nobs keeps html2text from marking up the text w/ BS and over-
# strike characters.
#
# This program requires the html2text program.
#
# Revision History:
# 2005-10-31: pdwilso@gmail.com - initial version
#
$start=1;
while (<>) {
chomp;
next if ($start && /LYRICS/);
next if ($start && /^\s*$/);
if(/^\s*\"/) {$start=0;}
exit if(/\[ www.azlyrics.com \]/);
s/^\s*//g;
print;
print"\n";
}





<< Home

Archives

2004/09   2005/03   2005/04   2005/05   2005/06   2005/07   2005/08   2005/09   2005/10   2005/11   2006/01   2006/02   2006/04   2006/05   2006/06   2008/01  

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]