Thursday, December 21, 2006

Dictionary @ Python

> Python dict is a hash table, isn't it? Yup.
> I know that hashtable has the concept of "bucket size" and "min bucket
> count" stuff,

Some implementations of hash tables do. Python's does not. Python's uses what's called "open addressing" instead.

> and they should be configurable so I can set them to the proper value
> when I know how many items I'm going to handle.
> If these two values can't be set, the hashtable will give them default
> values. When there are more and more items being added to the
> hashtable, it increase its buckets and copy the old items into the new
> one and re-calculate the hash value of each item.

That's several distinct claims, each of which is true of some hash table implementations but not of others. Python's implementation has no "buckets", so all that's simply irrelevant. Python's implementation (not all do) saves the hash value of each item, so when it does need to grow (or shrink) the space allocated to the table, it does not need to recalculate the hash value of any item.
You should also note that copying a dict key or value (no matter of what type) consists in its entirety of copying one machine address (a 4- or 8-byte pointer, depending on platform).

> I think this will waste some time doing the increasing-copying thing.

It does take time to resize. "Waste" is a value judgment ;-)
> If I know I'm going to handle about 20000 items, I can set the size of
> hashtable to 30000.
> So, can I do this in python?

You can't. Unless you build up to 20000 items over and over (many thousands of times), it's unlikely you'd be able to measure the time difference even if you could.

Ex:

1)creating the dictionary with no other Python overhead

import random; r = [ random.random() for i in range(20000)]; d = dict.fromkeys(r)

2)Creating the dictionary using a Python for loop
import random; r = [ random.random() for i in range(20000)];d={} ;for k in r: d[k] = None

No comments: