<?xml version="1.0"?>
<rss version="2.0"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:dcterms="http://purl.org/dc/terms/" >
<channel>
<title>Python - Justin&#x27;s Ramblings</title>
<link>http://bouncybouncy.net//ramblings/tags/python/</link>
<description>BB.Net</description>
<item>
	
	<title>Python Evolution: From Script To Program</title>
	
	<guid>http://bouncybouncy.net//ramblings/posts/python_evolution_from_script_to_program/</guid>
	<link>http://bouncybouncy.net//ramblings/posts/python_evolution_from_script_to_program/</link>
	
	
	<category>tags/python</category>
	
	<category>tags/tech</category>
	
	
	<pubDate>Sat, 21 Jun 2008 23:18:12 -0400</pubDate>
	<dcterms:modified>2008-06-22T15:12:26Z</dcterms:modified>
	
	<description><![CDATA[<p><a href=
"http://forums.thedailywtf.com/forums/p/6978/132159.aspx">The
Evolution of a Python Programmer</a> is funny, but it only covers
one aspect of programming. Many times I will see code that is fine
from a CS point of view, but absolutely horrible when it comes to
program structure and module organization.</p>
<p>You often see people saying things like "Hello World in python
is just 'print "Hello World"'", and that is true. It is very easy
to get started writing python, but if you don't structure your
modules correctly, you will be in a world of pain later on. It is
something that can be hard to explain, since the results in the
short term are the same, and it may not be clear at first why one
way of doing things is better than the other.</p>
<p>Instead of Hello World, let's take the example of a program to
get stock quotes. The actual implementation here is not relevant,
pretend it contacts a web service or database or something.</p>
<p>A common case is the "python script". I HATE python scripts.
"script" almost always ends up being a single file with no entry
points, no main function, and mixes IO with logic.</p>
<div class="syntax">
<pre>
s = raw_input("<span class="synConstant">symbol:</span>")
<span class="synStatement">if</span> s == '<span class=
"synConstant">MSFT</span>':
    <span class="synStatement">print</span> '<span class=
"synConstant">price=</span>', 28.23
<span class="synStatement">elif</span> s == '<span class=
"synConstant">GOOG</span>':
    <span class="synStatement">print</span> '<span class=
"synConstant">price=</span>', 546.43

</pre></div>
<p>The first step in fixing this is to define an actual function.
Now you can import the module and run get_price().</p>
<div class="syntax">
<pre>
<span class="synStatement">def</span> <span class=
"synIdentifier">get_price</span>():
    s = raw_input("<span class="synConstant">symbol:</span>")
    <span class="synStatement">if</span> s == '<span class=
"synConstant">MSFT</span>':
        <span class="synStatement">print</span> '<span class=
"synConstant">price=</span>', 28.23
    <span class="synStatement">elif</span> s == '<span class=
"synConstant">GOOG</span>':
        <span class="synStatement">print</span> '<span class=
"synConstant">price=</span>', 546.43

</pre></div>
<p>The (hopefully) obvious problem with this is that the IO is
mixed in with the logic. What if you wanted to get the stock price
for 1000 stocks and output a nice summary? This next version is
slightly better, here the input is a proper parameter, but you
still have no control over the output. You could get your 1000
quotes, but you would have no way to report on the output. Again,
this should be obvious, but I come across code that does this way
too often.</p>
<div class="syntax">
<pre>
<span class="synStatement">def</span> <span class=
"synIdentifier">get_price</span>(s):
    <span class="synStatement">if</span> s == '<span class=
"synConstant">MSFT</span>':
        <span class="synStatement">print</span> '<span class=
"synConstant">price=</span>', 28.23
    <span class="synStatement">elif</span> s == '<span class=
"synConstant">GOOG</span>':
        <span class="synStatement">print</span> '<span class=
"synConstant">price=</span>', 546.43
<span class="synComment">###</span>
<span class="synStatement">if</span> __name__ == "<span class=
"synConstant">__main__</span>":
    s = raw_input("<span class="synConstant">symbol:</span>")
    get_price(s)

</pre></div>
<p>The first respectable version adds a main() function that
handles the input and output. The main function should also get the
stock from the command line arguments, rather than interactively. I
think you tend to see things like this more often from windows
users, who like to double click on things rather than run them from
a shell. You could probably write a whole book on this subject
though <img src="http://bouncybouncy.net//ramblings/tags/python/../../../smileys/smile.png" alt=":-)" /></p>
<div class="syntax">
<pre>
<span class="synStatement">def</span> <span class=
"synIdentifier">get_price</span>(s):
    <span class="synStatement">if</span> s == '<span class=
"synConstant">MSFT</span>':
        <span class="synStatement">return</span> 28.23
    <span class="synStatement">elif</span> s == '<span class=
"synConstant">GOOG</span>':
        <span class="synStatement">return</span> 546.43
<span class="synComment">###</span>
<span class="synStatement">def</span> <span class=
"synIdentifier">main</span>():
    s = raw_input("<span class="synConstant">symbol:</span>")
    <span class="synStatement">print</span> '<span class=
"synConstant">price=</span>', get_price(s)

<span class="synStatement">if</span> __name__ == "<span class=
"synConstant">__main__</span>":
    main()

</pre></div>
<p>The final steps are to make a proper python package out of this
module, but I'll save that for a later post.</p>

]]></description>
	
</item>
<item>
	
	<title>how my dupe finding program works</title>
	
	<guid>http://bouncybouncy.net//ramblings/posts/how_my_dupe_finding_program_works/</guid>
	<link>http://bouncybouncy.net//ramblings/posts/how_my_dupe_finding_program_works/</link>
	
	
	<category>tags/python</category>
	
	<category>tags/tech</category>
	
	
	<pubDate>Thu, 21 Feb 2008 23:41:03 -0500</pubDate>
	<dcterms:modified>2008-02-22T04:59:18Z</dcterms:modified>
	
	<description><![CDATA[<h2>finding duplicate files</h2>
<p>This post is about my duplicate finding program available under
<a href="http://bouncybouncy.net//ramblings/tags/python/../../../programs/">Programs</a>. The program is a little
bare, and needs a nicer API, but the method it uses is the most
efficient one that I am aware of.</p>
<p>There are a couple of different ways you can find duplicate
files:</p>
<h3>Compute the hash of all the files, and look for duplicates</h3>
<p>This method works well if the files on disk are mostly static,
and files are added infrequently. In this case you can compute the
hashes once, and keep it around for later scans. However, if you
are only running the scan once, this method is not ideal since it
requires you to read the full contents of every file</p>
<h3>Compute the hash of files with the same size</h3>
<p>This is the method that I think fdupes still uses. It first
builds a candidate list of files that are the same size, and
computes the checksum of each. This method works well if most of
the files that are the same size are really duplicates, but
otherwise triggers too much unneeded IO.</p>
<h3>Compare all files with the same size in parallel</h3>
<p>This is the method that my program uses. Like fdupes, I first
built up a candidate list of files with the same size. Instead of
hashing the files, it simply reads each file at the same time,
comparing block by block. This is just like what the
<em>cmp(1)</em> program does, but for multiple files at the same
time. The benefit of this over calculating the files hash, is that
as soon as the files differ, you can stop reading.</p>
<h2>Implementation</h2>
<p>There are a couple of things you need to keep in mind to
implement this method.</p>
<h3>Don't open too many files.</h3>
<p>You have to be careful not to try and open too many files at
once. If the user has 5,000 files that all have the same size, the
program shouldn't try and open all 5,000 at once. My program uses a
simple helper class to handle opening and closing files. The
default blocksize in my program would probably waste a bit of
memory in this case, but that is easily changed.</p>
<h3>Correctly handle diverging sets.</h3>
<p>Imagine the filesystem contains 4 files of the same size, 'a',
'b','c', and 'd', where a==c, and b==d. While reading through the
files, it will become clear that a!=b, a==c, and a!=d. It is
important that at this step the program continues searching using
(a,c) and (b,d) as possible duplicates. This is implemented using
recursion, the sets (a,c) and (b,d) are fed back into the duplicate
finding function.</p>
<h2>Example run, compared to fdupes.</h2>
<p>Here is dupes.py running against fdupes on a modestly sized
directory. Notice how dupes.py only needs to read 600K(not counting
metadata).</p>
<p>According to iofileb.d from the dtrace toolkit, dupes.py reads
10M of data (which I think includes python), and fdupes reads 517M.
This alone explains the 20x speedup seen in dupes.py</p>
<div class="syntax">
<pre>
justin@pip:~$ du -hs $DIR
15G   $DIR

justin@pip:~$ time python code/dupes.py $DIR
2896 total files
35 size collisions, max of length 5
bytes read 647168

real    0m1.224s
user    0m0.234s
sys     0m0.494s

justin@pip:~$ time fdupes -r $DIR
real    0m41.694s
user    0m13.612s
sys     0m7.491s

justin@pip:~$ time python code/dupes.py $DIR
2896 total files
35 size collisions, max of length 5
bytes read 647168

real    0m3.662s
user    0m0.256s
sys     0m0.568s

justin@pip:~$ time fdupes -r $DIR
real    0m55.473s
user    0m11.383s
sys     0m6.433s

</pre></div>

]]></description>
	
</item>
<item>
	
	<title>regex with named groups</title>
	
	<guid>http://bouncybouncy.net//ramblings/posts/regex_with_named_groups/</guid>
	<link>http://bouncybouncy.net//ramblings/posts/regex_with_named_groups/</link>
	
	
	<category>tags/python</category>
	
	<category>tags/tech</category>
	
	
	<pubDate>Wed, 20 Feb 2008 11:42:21 -0500</pubDate>
	<dcterms:modified>2008-02-20T17:00:38Z</dcterms:modified>
	
	<description><![CDATA[<p>As I mentioned in a comment at <a href=
"http://handyfloss.wordpress.com/2008/02/19/some-more-tweaks-to-my-python-script/">
Some more tweaks to my Python script</a>, there are a lot of ways
you can use the re module. If you need to match multiple
expressions against each line, you can build up a single regular
expression that includes all the patterns, and used named groups to
tell them apart.</p>
<div class="syntax">
<pre>

<span class="synPreProc">import</span> re
<span class=
"synComment">#if you were matching many of these it would be a good idea</span>
<span class=
"synComment">#to make a function that simply fills in '%s&gt;(?P&lt;%s&gt;[^&lt;]+)&lt;'</span>
cpattern    = '<span class=
"synConstant">total_credit&gt;(?P&lt;credit&gt;[^&lt;]+)&lt;</span>'
opattern    = '<span class=
"synConstant">os_name&gt;(?P&lt;os&gt;[^&lt;]+)&lt;</span>'
pattern     = '<span class=
"synConstant">(%s)|(%s)</span>' % (cpattern, opattern)

search = re.compile(pattern).search

lines = [
    '<span class=
"synConstant">blah blah blah total_credit&gt;10&lt; blah blah</span>',
    '<span class=
"synConstant">hkfhsd klfjhs dfkljsdfsl fds</span>',
    '<span class=
"synConstant">hkashflksd os_name&gt;win&lt; hhkjhdflksj d</span>',
    '<span class=
"synConstant">hkfhsd klfjhs dfkljsdfsl fds</span>',
    '<span class=
"synConstant">blah blah blah total_credit&gt;20&lt; blah blah</span>',
]

<span class="synStatement">for</span> line <span class=
"synStatement">in</span> lines:
    r = search(line)
    <span class="synStatement">if</span> r:
        <span class="synStatement">print</span> r.groupdict()

</pre></div>
<p>Running this gives</p>
<div class="syntax">
<pre>
{'<span class="synConstant">credit</span>': '<span class=
"synConstant">10</span>', '<span class=
"synConstant">os</span>': None}
{'<span class="synConstant">credit</span>': None, '<span class=
"synConstant">os</span>': '<span class="synConstant">win</span>'}
{'<span class="synConstant">credit</span>': '<span class=
"synConstant">20</span>', '<span class=
"synConstant">os</span>': None}

</pre></div>
<p>In this case you could even generalize the regular expression
further, like so:</p>
<div class="syntax">
<pre>
pattern     = '<span class=
"synConstant">\s(?P&lt;key&gt;[^\s&gt;]+)&gt;(?P&lt;value&gt;[^&lt;]+)&lt;</span>'

</pre></div>
<p>Running that (probably less than optimal) regular expression
over the input gives</p>
<div class="syntax">
<pre>
{'<span class="synConstant">key</span>': '<span class=
"synConstant">total_credit</span>', '<span class=
"synConstant">value</span>': '<span class="synConstant">10</span>'}
{'<span class="synConstant">key</span>': '<span class=
"synConstant">os_name</span>', '<span class=
"synConstant">value</span>': '<span class=
"synConstant">win</span>'}
{'<span class="synConstant">key</span>': '<span class=
"synConstant">total_credit</span>', '<span class=
"synConstant">value</span>': '<span class="synConstant">20</span>'}

</pre></div>

]]></description>
	
</item>
<item>
	
	<title>dynamic ikiwiki pages</title>
	
	<guid>http://bouncybouncy.net//ramblings/posts/dynamic_ikiwiki_pages/</guid>
	<link>http://bouncybouncy.net//ramblings/posts/dynamic_ikiwiki_pages/</link>
	
	
	<category>tags/ikiwiki</category>
	
	<category>tags/meta</category>
	
	<category>tags/pylons</category>
	
	<category>tags/python</category>
	
	<category>tags/tech</category>
	
	
	<pubDate>Fri, 15 Feb 2008 20:57:58 -0500</pubDate>
	<dcterms:modified>2008-02-16T18:36:26Z</dcterms:modified>
	
	<description><![CDATA[<p>The static pages that <a href="http://ikiwiki.info">ikiwiki</a>
generates are great, but I want to have some dynamic content here
as well.</p>
<p>If this works, this page should include the servers uptime.</p>
<!--# include virtual="/dyn/demo/uptime" -->
<p>yay <img src="http://bouncybouncy.net//ramblings/tags/python/../../../smileys/smile.png" alt=":-)" /></p>
<p>So how does that work?</p>
<p>first configure nginx as follows</p>
<div class="syntax">
<pre>
server {
    listen       80;
    server_name  bouncybouncy.net  *.bouncybouncy.net web;

    location / {
        root   /home/justin/bbdotnet/static/;
        index  index.html index.htm;
        ssi on;
    }
    location /dyn {
        # All POST requests go to pylons directly
        include /usr/local/nginx/conf/proxy.conf;
        proxy_redirect  default; 
        if ($request_method = POST) {
            proxy_pass  http://127.0.0.1:5000;
            break;
        }
        default_type text/html; 

        set $memcached_key "$uri";
        memcached_pass localhost:11211;

        proxy_intercept_errors  on;

        # If no info would be found in memcache or memecache would be dead, go to real dynamic location
        error_page 404 502 = @dynamic_request;
    }
    location @dynamic_request{
        # This means, that we can't get to this location from outside - only by internal redirect
        internal;

        include /usr/local/nginx/conf/proxy.conf;
        proxy_redirect  default; 
        proxy_pass  http://127.0.0.1:5000;
    }

}

</pre></div>
<p>Pylons is setup to run on port 5000 as usual, nothing fancy
there.</p>
<p>Then anywhere we want some dynamic content we can simply do</p>
<div class="syntax">
<pre>
&lt;!--# include virtual="/dyn/demo/uptime" --&gt;

</pre></div>
<p>For now, you have to disable the htmlscrubber plugin for this to
work. There is probably a better solution. I think this would
simply involve a plugin that could run after htmlscrubber to insert
the include, then you would only need to have something like
[[include virtual="/dyn/demo/uptime"]] in your pages.</p>
<p>If you did not mind requring javscript, you could use <a href=
"http://www.mnot.net/javascript/hinclude/">HInclude</a> instead of
SSI.</p>
<p>To keep things running fast, we enable to caching on the pylons
controller. using a modified version of the beaker<em>cache
decorator. The following lines are inserted at the end of the
create</em>func method, which causes the page result to be cached
in memcache as well as in beaker.</p>
<div class="syntax">
<pre>
url = pylons.request.path_info
<span class="synStatement">if</span> pylons.request.params:
    url += "<span class=
"synConstant">?</span>" + pylons.request.environ['<span class=
"synConstant">QUERY_STRING</span>']

mc = memcache.Client(['<span class=
"synConstant">localhost</span>'])
mc.set(url, result, cache_expire)

</pre></div>
<p>The only remaining problem I see is a small race condition. If
the cache expires, and 20 concurrent requests all come in for the
page, most of them will end up hitting python instead of waiting
for the memcache key to appear. This might actually work better
using varnish or apache2 with <code>mod_disk_cache</code>, but the
last time I tried I could not get varnish to work at all, and
apache2 (I think) still does not support PURGE.</p>

]]></description>
	
</item>

</channel>
</rss>
