Full-Stack Multithreaded Web Crawler Search Engine

Java 17 · Systems Engineering · USF · Aug — Dec 2023

A search engine built entirely from scratch in Java — featuring an in-memory inverted index, multithreaded web crawler with raw-socket HTTP client, TF-based relevance ranking, and an embedded Jetty web server with a custom dark-theme search UI branded "WebDepth."

Every component is hand-built: the HTTP fetcher uses raw sockets (not HttpURLConnection), the thread pool and read/write lock are custom implementations, and the JSON serializer outputs without any third-party library. Developed through an iterative Git workflow with 10+ professor code reviews over one semester.

Try Live Demo View Source

Java Files

2,119

Lines of Code

1,653

Javadoc Lines

Java 17

Language

Architecture

Driver (CLI Entry)
  Parses flags: -text, -html, -query, -server, -threads
       │              │              │              │
       v              v              v              v
  Index Builder    Web Crawler    Search       Web Server
  Files.walk()   Raw sockets    Exact/Partial  Jetty + Servlet
  FileStemmer    HtmlFetcher    TF scoring     SearchServlet
       │              │              │              │
       v              v              v              v
  InvertedIndex / ThreadSafeInvertedIndex
  TreeMap<word, TreeMap<location, TreeSet<position>>>
       │                            │
  MultiReaderLock           WorkQueue
  N readers | 1 writer      Fixed thread pool + pending barrier

Concurrency Pattern:
  Thread-local InvertedIndex → mergeDistinct() → ThreadSafeInvertedIndex
  WorkQueue.finish() blocks until pending == 0

Key Features

Inverted Index — Nested TreeMap structure mapping words → document locations → positions, with a separate word-count map for TF scoring. Thread-safe variant wraps all access with a custom read/write lock
Raw-Socket HTTP Client — Implements HTTP/1.1 from scratch using Socket and SSLSocketFactory. Two-step HEAD-then-GET strategy avoids downloading non-HTML content. Follows up to 3 redirects
Multithreaded Web Crawler — BFS-like traversal from a seed URL using a pendingQueue. Each URL is a WorkQueue task that builds a thread-local index, then merges atomically into the shared index via mergeDistinct()
Exact & Partial Search — Exact search via HashMap lookup; partial (prefix) search via TreeMap.tailMap(). Results ranked by TF score → match count → location
Web UI ("WebDepth") — Dark-theme SPA with live crawl progress bar, "I'm Feeling Lucky" redirect, search history, server stats (uptime, queries, indexed pages), and index export download
Custom Concurrency Primitives — Hand-built WorkQueue thread pool with pending-counter barrier, and MultiReaderLock with reentrant writer following SEI CERT LCK00-J (private final lock object)
HTML Processing Pipeline — 6-stage regex chain: comments → block elements (head/style/script/noscript/iframe/svg) → tags → entities. Separate strip path preserves anchor tags for link extraction
Custom JSON Writer — From-scratch JSON serialization supporting nested objects, arrays, and 8-decimal precision scores — no Jackson or Gson dependency
XSS Protection — All user input escaped via Apache Commons Text escapeHtml4() before rendering

Code Highlights

Custom Read/Write Lock (MultiReaderLock)

public class MultiReaderLock {
    private final Object lock = new Object();  // SEI CERT LCK00-J
    private int readers, writers;
    private Thread activeWriter;

    public void readLock() {
        synchronized (lock) {
            while (writers > 0 && !isActiveWriter()) lock.wait();
            readers++;
        }
    }
    public void writeLock() {
        synchronized (lock) {
            while (!isActiveWriter() && (readers > 0 || writers > 0)) lock.wait();
            writers++;
            activeWriter = Thread.currentThread();  // reentrant
        }
    }
}

Partial Search via TreeMap.tailMap()

public List<QueryResult> partialSearch(Set<String> queries) {
    HashMap<String, QueryResult> results = new HashMap<>();
    for (String query : queries) {
        for (var entry : index.tailMap(query).entrySet()) {
            if (!entry.getKey().startsWith(query)) break;  // sorted — stop early
            updateMatches(entry.getValue(), results);
        }
    }
    return results.values().stream().sorted().toList();  // score → count → loc
}

WorkQueue with Pending-Counter Barrier

public class WorkQueue {
    private final LinkedList<Runnable> queue = new LinkedList<>();
    private int pending;

    public void execute(Runnable task) {
        synchronized (this) { pending++; }
        synchronized (queue) { queue.addLast(() -> {
            try { task.run(); }
            finally { synchronized (WorkQueue.this) {
                pending--; if (pending == 0) WorkQueue.this.notifyAll();
            }}
        }); queue.notifyAll(); }
    }
    public synchronized void finish() {
        while (pending > 0) wait();  // blocks until all tasks done
    }
}

Design Decisions

Raw sockets over HttpURLConnection — Implementing HTTP/1.1 from scratch (request writing, header parsing, SSL handshake) provided full control over the fetch pipeline and enabled the HEAD-then-GET optimization to skip non-HTML content
TreeMap over HashMap for the index — TreeMap maintains sorted key order, enabling O(log n) prefix search via tailMap() with early termination. A HashMap would require scanning all keys for partial matches
Local-then-merge concurrency pattern — Each crawler thread builds a private InvertedIndex, then merges into the shared ThreadSafeInvertedIndex via mergeDistinct(). This minimizes lock contention compared to per-word locking
Custom lock over ReentrantReadWriteLock — Building MultiReaderLock from scratch with a private final lock object follows SEI CERT LCK00-J, provides reentrant write semantics, and avoids lock-ordering deadlocks in the merge path
Pending-counter over CountDownLatch — WorkQueue.finish() uses an incrementing/decrementing counter instead of CountDownLatch because the total number of tasks is unknown at submission time (crawler discovers URLs dynamically)
Custom JSON serializer over Jackson/Gson — Project requirement: no serialization libraries. The hand-built JsonWriter supports nested structures with 8-decimal precision for ranking scores

Web UI Features

"I'm Feeling Lucky" — Performs the search server-side and issues a meta-refresh redirect to the top-ranked result URL
Live crawl from browser — Enter a seed URL, configure page limit (1–200) and thread count (1–10); progress bar polls the server every 500ms showing percentage completed
Server stats — Displays uptime, total queries served (thread-safe AtomicInteger), indexed pages, and indexed words
Search history — In-memory history log viewable and clearable from the UI
Index export — Download the full inverted index as a JSON file via application/octet-stream response

How It Works

Inverted Index: The core data structure is a nested TreeMap<String, TreeMap<String, TreeSet<Integer>>> mapping stemmed words to document locations and word positions. A separate TreeMap<String, Integer> tracks total word counts per document for TF scoring. The thread-safe wrapper delegates all reads through readLock() and writes through writeLock(). Merging thread-local indexes uses computeIfAbsent() chains for lazy initialization.

Web Crawler: HtmlFetcher first sends an HTTP HEAD request to verify the content is text/html with status 200, then issues a GET for the body. HtmlCleaner strips block elements while preserving anchor tags for link extraction, then fully strips for indexing. LinkFinder normalizes URLs (fragment removal, path defaulting, encoding) and deduplicates via a HashSet. In multithreaded mode, the crawler maintains a BFS pendingQueue protected by a dedicated crawlLock.

Search & Ranking: The SearchProcessor uses the Strategy pattern: a Function<Set<String>, List<QueryResult>> reference switches between exact and partial search at construction. Results are ranked by TF score (matchCount / totalWords), then by raw match count descending, then by location alphabetically. The inner QueryResult class implements Comparable with this three-tier ordering.

Concurrency Model: The WorkQueue manages a fixed pool of daemon Worker threads pulling Runnable tasks from a LinkedList. Each task is wrapped in a lambda that auto-decrements a pending counter on completion, enabling finish() to block until all work is done. MultiReaderLock tracks an activeWriter thread reference for reentrancy — an active writer can re-acquire both read and write locks without deadlocking.

Tech Stack

Language

Java 17 (records, sealed classes)

Build

Apache Maven

NLP

Apache OpenNLP (Snowball Stemmer)

Logging

Apache Log4j2

Web Server

Eclipse Jetty 11 + Jakarta Servlets

Testing

JUnit 5

Security

Apache Commons Text (XSS escaping)

HTTP

Raw sockets + SSLSocketFactory

Java Multithreading Inverted Index Web Crawling Raw Sockets Jetty TF Ranking Read/Write Locks Servlets Custom Thread Pool