Full-Stack Multithreaded Web Crawler Search Engine

Java 17 · Systems Engineering · USF · Aug — Dec 2023

A search engine built entirely from scratch in Java — featuring an in-memory inverted index, multithreaded web crawler with raw-socket HTTP client, TF-based relevance ranking, and an embedded Jetty web server with a custom dark-theme search UI branded "WebDepth."

Every component is hand-built: the HTTP fetcher uses raw sockets (not HttpURLConnection), the thread pool and read/write lock are custom implementations, and the JSON serializer outputs without any third-party library. Developed through an iterative Git workflow with 10+ professor code reviews over one semester.

Try Live Demo View Source
23
Java Files
2,119
Lines of Code
1,653
Javadoc Lines
Java 17
Language

Architecture

Driver (CLI Entry)
  Parses flags: -text, -html, -query, -server, -threads
       │              │              │              │
       v              v              v              v
  Index Builder    Web Crawler    Search       Web Server
  Files.walk()   Raw sockets    Exact/Partial  Jetty + Servlet
  FileStemmer    HtmlFetcher    TF scoring     SearchServlet
       │              │              │              │
       v              v              v              v
  InvertedIndex / ThreadSafeInvertedIndex
  TreeMap<word, TreeMap<location, TreeSet<position>>>
       │                            │
  MultiReaderLock           WorkQueue
  N readers | 1 writer      Fixed thread pool + pending barrier

Concurrency Pattern:
  Thread-local InvertedIndex → mergeDistinct() → ThreadSafeInvertedIndex
  WorkQueue.finish() blocks until pending == 0

Key Features

Code Highlights

Custom Read/Write Lock (MultiReaderLock)
public class MultiReaderLock {
    private final Object lock = new Object();  // SEI CERT LCK00-J
    private int readers, writers;
    private Thread activeWriter;

    public void readLock() {
        synchronized (lock) {
            while (writers > 0 && !isActiveWriter()) lock.wait();
            readers++;
        }
    }
    public void writeLock() {
        synchronized (lock) {
            while (!isActiveWriter() && (readers > 0 || writers > 0)) lock.wait();
            writers++;
            activeWriter = Thread.currentThread();  // reentrant
        }
    }
}
Partial Search via TreeMap.tailMap()
public List<QueryResult> partialSearch(Set<String> queries) {
    HashMap<String, QueryResult> results = new HashMap<>();
    for (String query : queries) {
        for (var entry : index.tailMap(query).entrySet()) {
            if (!entry.getKey().startsWith(query)) break;  // sorted — stop early
            updateMatches(entry.getValue(), results);
        }
    }
    return results.values().stream().sorted().toList();  // score → count → loc
}
WorkQueue with Pending-Counter Barrier
public class WorkQueue {
    private final LinkedList<Runnable> queue = new LinkedList<>();
    private int pending;

    public void execute(Runnable task) {
        synchronized (this) { pending++; }
        synchronized (queue) { queue.addLast(() -> {
            try { task.run(); }
            finally { synchronized (WorkQueue.this) {
                pending--; if (pending == 0) WorkQueue.this.notifyAll();
            }}
        }); queue.notifyAll(); }
    }
    public synchronized void finish() {
        while (pending > 0) wait();  // blocks until all tasks done
    }
}

Design Decisions

Web UI Features

How It Works

Inverted Index: The core data structure is a nested TreeMap<String, TreeMap<String, TreeSet<Integer>>> mapping stemmed words to document locations and word positions. A separate TreeMap<String, Integer> tracks total word counts per document for TF scoring. The thread-safe wrapper delegates all reads through readLock() and writes through writeLock(). Merging thread-local indexes uses computeIfAbsent() chains for lazy initialization.

Web Crawler: HtmlFetcher first sends an HTTP HEAD request to verify the content is text/html with status 200, then issues a GET for the body. HtmlCleaner strips block elements while preserving anchor tags for link extraction, then fully strips for indexing. LinkFinder normalizes URLs (fragment removal, path defaulting, encoding) and deduplicates via a HashSet. In multithreaded mode, the crawler maintains a BFS pendingQueue protected by a dedicated crawlLock.

Search & Ranking: The SearchProcessor uses the Strategy pattern: a Function<Set<String>, List<QueryResult>> reference switches between exact and partial search at construction. Results are ranked by TF score (matchCount / totalWords), then by raw match count descending, then by location alphabetically. The inner QueryResult class implements Comparable with this three-tier ordering.

Concurrency Model: The WorkQueue manages a fixed pool of daemon Worker threads pulling Runnable tasks from a LinkedList. Each task is wrapped in a lambda that auto-decrements a pending counter on completion, enabling finish() to block until all work is done. MultiReaderLock tracks an activeWriter thread reference for reentrancy — an active writer can re-acquire both read and write locks without deadlocking.

Tech Stack

Language
Java 17 (records, sealed classes)
Build
Apache Maven
NLP
Apache OpenNLP (Snowball Stemmer)
Logging
Apache Log4j2
Web Server
Eclipse Jetty 11 + Jakarta Servlets
Testing
JUnit 5
Security
Apache Commons Text (XSS escaping)
HTTP
Raw sockets + SSLSocketFactory
Java Multithreading Inverted Index Web Crawling Raw Sockets Jetty TF Ranking Read/Write Locks Servlets Custom Thread Pool