A search engine built entirely from scratch in Java — featuring an in-memory inverted index, multithreaded web crawler with raw-socket HTTP client, TF-based relevance ranking, and an embedded Jetty web server with a custom dark-theme search UI branded "WebDepth."
Every component is hand-built: the HTTP fetcher uses raw sockets (not HttpURLConnection), the thread pool and read/write lock are custom implementations, and the JSON serializer outputs without any third-party library. Developed through an iterative Git workflow with 10+ professor code reviews over one semester.
Driver (CLI Entry) Parses flags: -text, -html, -query, -server, -threads │ │ │ │ v v v v Index Builder Web Crawler Search Web Server Files.walk() Raw sockets Exact/Partial Jetty + Servlet FileStemmer HtmlFetcher TF scoring SearchServlet │ │ │ │ v v v v InvertedIndex / ThreadSafeInvertedIndex TreeMap<word, TreeMap<location, TreeSet<position>>> │ │ MultiReaderLock WorkQueue N readers | 1 writer Fixed thread pool + pending barrier Concurrency Pattern: Thread-local InvertedIndex → mergeDistinct() → ThreadSafeInvertedIndex WorkQueue.finish() blocks until pending == 0
TreeMap structure mapping words → document locations → positions, with a separate word-count map for TF scoring. Thread-safe variant wraps all access with a custom read/write lockSocket and SSLSocketFactory. Two-step HEAD-then-GET strategy avoids downloading non-HTML content. Follows up to 3 redirectspendingQueue. Each URL is a WorkQueue task that builds a thread-local index, then merges atomically into the shared index via mergeDistinct()HashMap lookup; partial (prefix) search via TreeMap.tailMap(). Results ranked by TF score → match count → locationWorkQueue thread pool with pending-counter barrier, and MultiReaderLock with reentrant writer following SEI CERT LCK00-J (private final lock object)escapeHtml4() before renderingpublic class MultiReaderLock { private final Object lock = new Object(); // SEI CERT LCK00-J private int readers, writers; private Thread activeWriter; public void readLock() { synchronized (lock) { while (writers > 0 && !isActiveWriter()) lock.wait(); readers++; } } public void writeLock() { synchronized (lock) { while (!isActiveWriter() && (readers > 0 || writers > 0)) lock.wait(); writers++; activeWriter = Thread.currentThread(); // reentrant } } }
public List<QueryResult> partialSearch(Set<String> queries) { HashMap<String, QueryResult> results = new HashMap<>(); for (String query : queries) { for (var entry : index.tailMap(query).entrySet()) { if (!entry.getKey().startsWith(query)) break; // sorted — stop early updateMatches(entry.getValue(), results); } } return results.values().stream().sorted().toList(); // score → count → loc }
public class WorkQueue { private final LinkedList<Runnable> queue = new LinkedList<>(); private int pending; public void execute(Runnable task) { synchronized (this) { pending++; } synchronized (queue) { queue.addLast(() -> { try { task.run(); } finally { synchronized (WorkQueue.this) { pending--; if (pending == 0) WorkQueue.this.notifyAll(); }} }); queue.notifyAll(); } } public synchronized void finish() { while (pending > 0) wait(); // blocks until all tasks done } }
TreeMap maintains sorted key order, enabling O(log n) prefix search via tailMap() with early termination. A HashMap would require scanning all keys for partial matchesInvertedIndex, then merges into the shared ThreadSafeInvertedIndex via mergeDistinct(). This minimizes lock contention compared to per-word lockingMultiReaderLock from scratch with a private final lock object follows SEI CERT LCK00-J, provides reentrant write semantics, and avoids lock-ordering deadlocks in the merge pathWorkQueue.finish() uses an incrementing/decrementing counter instead of CountDownLatch because the total number of tasks is unknown at submission time (crawler discovers URLs dynamically)JsonWriter supports nested structures with 8-decimal precision for ranking scoresAtomicInteger), indexed pages, and indexed wordsapplication/octet-stream responseInverted Index: The core data structure is a nested TreeMap<String, TreeMap<String, TreeSet<Integer>>> mapping stemmed words to document locations and word positions. A separate TreeMap<String, Integer> tracks total word counts per document for TF scoring. The thread-safe wrapper delegates all reads through readLock() and writes through writeLock(). Merging thread-local indexes uses computeIfAbsent() chains for lazy initialization.
Web Crawler: HtmlFetcher first sends an HTTP HEAD request to verify the content is text/html with status 200, then issues a GET for the body. HtmlCleaner strips block elements while preserving anchor tags for link extraction, then fully strips for indexing. LinkFinder normalizes URLs (fragment removal, path defaulting, encoding) and deduplicates via a HashSet. In multithreaded mode, the crawler maintains a BFS pendingQueue protected by a dedicated crawlLock.
Search & Ranking: The SearchProcessor uses the Strategy pattern: a Function<Set<String>, List<QueryResult>> reference switches between exact and partial search at construction. Results are ranked by TF score (matchCount / totalWords), then by raw match count descending, then by location alphabetically. The inner QueryResult class implements Comparable with this three-tier ordering.
Concurrency Model: The WorkQueue manages a fixed pool of daemon Worker threads pulling Runnable tasks from a LinkedList. Each task is wrapped in a lambda that auto-decrements a pending counter on completion, enabling finish() to block until all work is done. MultiReaderLock tracks an activeWriter thread reference for reentrancy — an active writer can re-acquire both read and write locks without deadlocking.