Skip to main content

Set summary

Java Set Deep Dive - Official Documentation

The Set Architecture

Official Definition: "A collection that contains no duplicate elements. More formally, sets contain no pair of elements e1 and e2 such that e1.equals(e2)." — java.util.Set

1. The Complete Hierarchy

Before understanding the backend, you must understand the architectural flow. The Set interface inherits directly from Collection.

Iterable
Collection
Set
HashSet
LinkedHashSet
SortedSet
NavigableSet
TreeSet

2. How Set Prevents Duplicates

This is a critical backend concept. A Set does not write manual for loops to check if an element exists. Instead, the Set interface internally relies on Map implementations.

Uniqueness is guaranteed strictly by utilizing two core Object methods:

  • hashCode(): Finds the memory bucket.
  • equals(): Checks if the object exactly matches existing data in that bucket.
Set<Integer> set = new HashSet<>(); set.add(10); set.add(10); // Evaluates to duplicate, rejected silently System.out.println(set); // Output: [10]

HashSet Deep Dive

Official Definition: "This class implements the Set interface, backed by a hash table (actually a HashMap instance). It makes no guarantees as to the iteration order of the set." — java.util.HashSet

1. The Internal Backend Reality

If you look at the raw Java source code for HashSet, you will discover that it doesn't store data by itself. It is merely a wrapper around a HashMap.

// Inside java.util.HashSet private transient HashMap<E,Object> map; // The Hidden Dummy Object private static final Object PRESENT = new Object();

Because a Map requires both a Key and a Value, the HashSet stores your actual data as the Key. It uses a static, empty dummy object called PRESENT to fill the Value requirement.

2. Backend Flow of add()

When you call set.add("Java"), this is the exact logic executed:

public boolean add(E e) { return map.put(e, PRESENT) == null; }

If map.put() returns null, it means the key didn't exist, and the insertion was successful. If it returns an object (the PRESENT object), it means the key was a duplicate, and the Set rejects it by returning false.

3. The Bucket Array & Index Formula

The backing HashMap uses an array of buckets. Java calculates exactly where to put your data using a bitwise formula:

index = (n - 1) & hash

Where n is the table size. This bitwise operation is significantly faster than standard modulo (%) arithmetic.

4. Collision Handling & Treeification (Java 8)

If two distinct keys generate the same hash and land in the same bucket, a Collision occurs.

  • Java 7: Handled collisions using a standard Linked List ($O(n)$ search time).
  • Java 8+: If a single bucket exceeds 8 elements (and total table size $\ge 64$), the bucket's linked list violently converts into a Red-Black Tree. This drops search complexity from $O(n)$ down to a highly optimized $O(\log n)$.

5. The Mutable Key Danger

Never use mutable objects (objects whose data can change after creation) as elements in a HashSet. If you insert a Student object, and later change the Student's ID, its hashCode() will change. The Set will never be able to find it in the original bucket, creating a fatal memory leak and broken lookups.

LinkedHashSet Backend

Official Definition: "Hash table and linked list implementation of the Set interface, with predictable iteration order. This implementation differs from HashSet in that it maintains a doubly-linked list running through all of its entries." — java.util.LinkedHashSet

1. Why was it created?

A standard HashSet stores items chaotically based on memory hashes. If you insert 10, 50, and 20, they might print out as 50, 10, 20. LinkedHashSet was engineered specifically to solve this by explicitly preserving Insertion Order.

2. Internal Data Structure

Just as a HashSet wraps a HashMap, a LinkedHashSet wraps a LinkedHashMap. It uses a hybrid architecture:

  • Hash Table: For extremely fast $O(1)$ lookups.
  • Doubly Linked List: To string all the entries together in the exact order they were inserted.
// Internal Entry Structure inherited from LinkedHashMap static class Entry<K,V> extends HashMap.Node<K,V> { Entry<K,V> before; Entry<K,V> after; }

3. Memory vs. Performance Tradeoff

The time complexities for add(), remove(), and contains() remain $O(1)$. However, it requires significantly more memory because every single node must store two additional memory pointers (before and after).

Real-world use case: Caching recent search history where uniqueness and chronological order are equally vital.

TreeSet & Red-Black Trees

Official Definition: "A NavigableSet implementation based on a TreeMap. The elements are ordered using their natural ordering, or by a Comparator provided at set creation time." — java.util.TreeSet

1. The Internal Backend

TreeSet completely abandons Hash Tables. It does not use hashCode() or buckets. Internally, it is backed by a TreeMap, which utilizes a Red-Black Tree—a self-balancing binary search tree.

// Internal Node Structure static final class Entry<K,V> { K key; Entry<K,V> left; Entry<K,V> right; Entry<K,V> parent; boolean color; // RED or BLACK }

2. The Balancing Act

If you insert sequential data (e.g., 10, 20, 30, 40), a standard binary tree would devolve into a linked list, degrading search time to $O(n)$.

The Red-Black Tree enforces strict mathematical coloring rules. If the tree becomes top-heavy, it automatically executes Rotations and Recoloring to ensure the maximum depth is minimized. This guarantees that add(), remove(), and contains() strictly operate in $O(\log n)$ time.

3. The Null Restriction

You cannot add null to a TreeSet. Because it relies heavily on comparing elements (e.g., element.compareTo(target)), attempting to compare a null reference immediately triggers a NullPointerException.

Summary & Interview Guide

1. The Ultimate Comparison

Feature HashSet LinkedHashSet TreeSet
Internal Backend HashMap LinkedHashMap TreeMap (Red-Black Tree)
Element Ordering Chaotic / None Insertion Order Sorted (Natural/Comparator)
Null Elements Yes (1 allowed) Yes (1 allowed) No (Throws NPE)
Speed (add/search) $O(1)$ $O(1)$ $O(\log n)$
Memory Footprint Medium High (Extra Pointers) Highest (Tree Pointers + Colors)

2. Important Interview Questions

  • Q: What is the backend of HashSet?
    A: A HashMap. It stores the actual elements as the keys, and uses a dummy static object named PRESENT for the values.
  • Q: Why is TreeSet slower than HashSet?
    A: Because TreeSet must perform mathematical comparisons, node recoloring, and physical tree rotations upon insertion to maintain a perfectly balanced Red-Black Tree ($O(\log n)$), whereas HashSet just computes a hash and drops the item in a bucket ($O(1)$).
  • Q: Why do equals() and hashCode() matter for Sets?
    A: For hash-based sets, hashCode() determines the memory bucket. If two objects land in the same bucket, equals() is executed to confirm if they are truly identical to prevent duplicates.

Comments

Popular posts from this blog

How I Got Selected in MNC

Virtusa Sometimes success does not come from having the best coding skills or the perfect roadmap. Sometimes it comes from simply refusing to quit. This is the honest story of how I transitioned from a confused, rejected fresher to getting selected as an Associate Engineer at Virtusa. The Beginning: Confused About My Future After completing my graduation, I stared blankly at my career options. Like many freshers, I lacked a clear direction. Should I join a Java course? Should I prepare on my own? Should I just wait for campus placement opportunities? One day, I called my friend Chetan. He suggested I join Naresh i Technologies and start learning Java seriously. Still unsure of my path, I told him I needed time to think about it. A couple of days later, my phone buzzed with a WhatsApp message offering a job opportunity. They asked me to come for the next round of the recruitment process. Excitement completely took over. I packed my bags, traveled to th...

Spring Boot Introduction

Spring Boot Introduction: Architecture, Dependencies, and Embedded Servers Modern enterprise applications demand rapid development, frictionless deployment, and absolute minimal configuration. Before Spring Boot arrived, developers utilizing the Spring Framework wasted immense amounts of time configuring XML files, managing clashing dependencies, setting up clunky application servers, and stitching various Spring modules together manually. To eliminate these bottlenecks, Pivotal introduced Spring Boot . Built entirely on top of the traditional Spring Framework, Spring Boot is an "opinionated" framework. It aggressively simplifies application development by injecting auto-configuration, packaging starter dependencies, and embedding web servers directly into your application. This allows backend developers to focus entirely on building business logic rather than wrestling with infrastructure setup. What is Spring Boot? Spring Boot is a powerful extens...

Strings in C

C Programming: Working with Strings Unlike modern programming languages like Python or Java, C does not possess a dedicated "String" data type. Instead, C treats a string as a simple 1D array of characters. Real-life example: Think of a freight train. The train does not exist as one solid object; it consists of individual boxcars linked together. Similarly, a C string links individual characters side-by-side in memory. To let the computer know the train has ended, C attaches a special "caboose" called the Null Terminator ( \0 ). 1. Essential String Functions Handling strings manually requires complex loops. To save time, C provides a built-in library called <string.h> that contains powerful functions to manipulate text. strlen(): You use this to find the exact length of a string. The compiler counts the characters until it hits the \0 terminator. It does not count the terminator itself. strcpy(): You ...