Why HashMaps in Java are so cool
A study tour of HashMap internals, the small engineering decisions that keep it fast in the bad cases, and the contract on the user side that quietly breaks it.
For most of my time writing Java, HashMap was just put and get, two methods I trusted to be “basically O(1)” the way I trust gravity to be basically nine point eight. Then one day I had to override equals and hashCode on a class I was using as a key, and the documentation kept saying things like “must agree with equals” and “good distribution is important” without explaining what would actually happen if I got it wrong. So I opened the OpenJDK source, mostly to confirm that I was being a coward for not having done it sooner.
I was. The thing I came away with: HashMap is cool because every part of it is engineering for the case where someone, sometimes me, sometimes an adversary, sometimes me in a particularly creative mood, hands it data that makes the naive version fall over. The clever bits are not the algorithm, they are the small decisions around the algorithm that keep it fast in the bad cases.
This is a tour of those decisions, from the perspective of someone who is still learning. I am leaning on the source and on JEP 180 more than on a decade of war stories I do not have.
The basic idea, briefly
A hash map stores key-value pairs in an internal array. To put(k, v), you compute hash(k), turn that hash into an index into the array, and store the entry there. To get(k), you do the same and read.
The catch is collisions: two different keys can land at the same index. The classic answer is that each array slot (“bucket”) points at a linked list of entries, and you walk that list comparing with equals until you find your key. If everything goes well, you walk one entry. If everything goes badly, you walk all of them, and your “basically O(1)” lookup turns into a linear scan with the lights off.
If the hash function spreads keys evenly and the bucket is not full, average lookup is O(1). If it does not, you fall back to walking the list, which is O(n). That O(n) cliff is where all the interesting engineering lives, which is to say, where the JDK authors had the most fun.
Cool thing #1: how Java picks the bucket
Java’s HashMap uses an internal array whose capacity is always a power of two. The default is 16, and resizes double it. This is not because powers of two are pretty (although they are), it is so the “which bucket” calculation can be a bitwise AND instead of a modulo:
int index = (capacity - 1) & hash;
capacity - 1 is a mask of low bits (15 is 0b1111, 31 is 0b11111, and so on). The AND keeps only those low bits of the hash. This is significantly faster than hash % capacity, and put/get is the hottest path in roughly every Java program written, including the ones that are mostly logging.
There is a catch. If you only keep the low bits, the high bits of the hash are thrown away. If two objects happen to differ only in their high bits, they will collide every time, even though their hashCode() values are different. The author of that hashCode did the work in good faith, and the bucket calculation just dropped it on the floor. So Java’s HashMap does this small, almost embarrassed thing:
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
It takes the hash code, shifts it right by 16, and XORs the result back in. That folds the high 16 bits down into the low 16 bits before the bucket mask is applied. So even if a user’s hashCode() puts all its entropy in the high bits, some of that entropy survives the mask and ends up affecting which bucket the entry lands in.
The first time I read this I thought it was a hack. Now I think it is the right kind of hack: cheap, defensive, and quietly assumes the worst about the data it is given. Like a good bouncer.
Cool thing #2: when the bucket gets long, it grows up
Before Java 8, when a single bucket’s linked list got long, because of collisions, a bad hashCode, or someone with too much free time submitting adversarial input, get on that bucket became a linear scan. In the absolute worst case, every key collided and HashMap degenerated to a glorified linked list, which is exactly the data structure people use HashMap to not be.
This is not just an academic worry. It is the basis for hash flooding denial-of-service attacks, where an attacker submits inputs designed to all collide. Your nice O(1) endpoint becomes an O(n) endpoint for that one request, and n is “everything they sent.” It went badly for some real frameworks.
JEP 180 changed this. In Java 8 and later, when a bucket’s list grows beyond a threshold (TREEIFY_THRESHOLD = 8) and the table is large enough (MIN_TREEIFY_CAPACITY = 64), HashMap converts that bucket from a linked list into a red-black tree. Lookups inside a tree-ified bucket are O(log n) instead of O(n). The worst case is still bad, but it is now the kind of bad you can quietly live with.
The bit I had to read twice to believe: the tree needs an ordering, but Object does not have a natural one. Java handles this in two stages. If the keys implement Comparable, it uses that. If not, it falls back to comparing System.identityHashCode, which is at least a total ordering, even if it is total in the sense of “a coin toss, but deterministic.” The tree might not be in a meaningful order, but it is a valid binary search tree, which is all the operations actually need.
That, to me, is the genuinely cool part. The data structure does not assume the keys are comparable, and still salvages O(log n) worst case from a situation where the original linked-list HashMap would have face-planted. They could have said “your hash function is bad, you get what you deserve.” They engineered around the user instead, which is most of what good library design is.
Cool thing #3: resize is sneakier than it looks
HashMap has a loadFactor, defaulting to 0.75, and a threshold equal to capacity * loadFactor. When size exceeds threshold, the table doubles and every entry has to be placed in its new bucket.
The 0.75 number is not a vibe. It is a trade-off between space and time: too low and you waste memory on empty buckets, too high and collisions become common and lookups slow down. 0.75 is empirically near the sweet spot, and the Javadoc says so, which is unusual and honest of it.
The clever bit is what happens during the resize. Because the capacity is always a power of two and doubles, when you rehash an entry, its new bucket index is either the same as before, or exactly oldCapacity higher. You can tell which one without recomputing the hash, just by looking at the one new bit that just got added to the mask:
if ((entry.hash & oldCapacity) == 0) {
// stays in the same bucket index
} else {
// moves to index + oldCapacity
}
So resize splits each old bucket into at most two new ones, in a single pass, without calling hashCode again. It is the kind of optimisation that, when you spot it in a code review, you make a small “huh” noise and then never forget. Resizes are not free, and this makes them cheaper than the naive “recompute every hash” approach for no obvious cost.
Where it falls over
For all the engineering, the contract HashMap depends on still lives on the user side. Get any of the following wrong and the structure quietly stops doing what you think it does, which is somehow worse than failing loudly.
- Break the
equals/hashCodecontract and you get keys you cannot find. The rule is: ifa.equals(b), thena.hashCode() == b.hashCode(). The reverse does not have to hold (two unequal objects can share a hash code, that is just a collision and is fine), but the forward direction is non-negotiable. Violate it andput(a, v)followed byget(b)returns null, even thougha.equals(b)is true, and you will spend an afternoon wondering if the JVM is broken. It is not. You are. - Mutate a key after putting it. If the key’s
hashCode()changes after it has been placed in the map, the map can no longer find it. The entry is still there, sitting in its old bucket, but lookup goes somewhere else now. The entry has not been deleted, it has been hidden. Effective Java spends a whole item on this for a reason, and the reason is that everyone tries it once. - Use a poor
hashCode. AhashCodethat always returns0, or1, or42because you find it funny, technically satisfies the contract. It also turns every operation into a tree walk and makes yourHashMapperform like a stack of envelopes. The map still works. It is just slow in a way that does not show up until production data shows up. - Use it from multiple threads.
HashMapis not thread-safe. Concurrent puts during a resize have, historically, been able to corrupt the internal structure into an infinite loop inget, which is the kind of thing you find out at 2am. UseConcurrentHashMapif you need concurrency, do not just wrap aHashMapinsynchronizedand assume nobody else will.
What I actually took away
The thing I keep coming back to is that the data structure is mostly engineering for the bad case. The average case is obvious: hash, mask, look up, done. The hard part is what to do when the hash is bad, when the input is adversarial, when the bucket gets long, when the table needs to grow, when the keys are not comparable, when somebody decides to put a mutable object in there for reasons they do not remember.
The clever parts of
HashMapare not the algorithm, they are the defences.
That, for me, is what makes HashMap cool. It is a textbook data structure that has been quietly hardened over decades against the cases the textbook politely does not cover. Once you see it that way, “basically O(1)” stops being a hand-wave and starts looking like a promise the implementation has worked very hard to keep.
Worth opening the source the next time you take it for granted. The comments alone are worth the trip.