Designing a Scalable Database Schema for Reddit-like Comments - Part 1

Designing a Scalable Database Schema for Reddit-like Comments - Part 1

INTRODUCTION

Designing a database for a Reddit-like comment system presents unique challenges due to the hierarchical and potentially infinite nesting of comments. Efficiently storing, retrieving, and managing these comments requires a robust database schema that can handle complex queries and high traffic. In this comprehensive guide, we'll explore various methods to model hierarchical data in relational databases, delve into the pros and cons of each, and ultimately design an optimized schema suitable for a Reddit-style platform.Start writing today. Use the button below to create a Substack of your own

1. Understanding Reddit’s Comment Structure

Reddit’s comment system allows users to engage in discussions through nested comments, forming a tree-like structure. Key characteristics include:

  • Infinite Nesting: Comments can be nested to any depth.

  • High Read and Write Volume: Popular posts may have thousands of comments.

  • Dynamic Sorting: Comments can be sorted by new, top, controversial, etc.

  • User Interaction: Upvotes, downvotes, and user metadata are associated with comments.

2. Key Design Considerations

When designing the database schema, consider the following:

  • Read Efficiency: Ability to fetch a comment and its entire subtree efficiently.

  • Write Performance: Quick insertion of new comments without significant overhead.

  • Scalability: Handle growing data volume and user load.

  • Flexibility: Support various query patterns and sorting mechanisms.


3. Modeling Hierarchical Data

Several models exist to represent hierarchical data in relational databases. We’ll explore each method, discussing their advantages and disadvantages.

a. Adjacency List Model

Each record contains a reference to its parent, forming a simple parent-child relationship.

Schema:

CREATE TABLE Comment (
    id INT PRIMARY KEY AUTO_INCREMENT,
    post_id INT NOT NULL,
    parent_id INT NULL,
    content TEXT NOT NULL,
    author_id INT NOT NULL,
    created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (parent_id) REFERENCES Comment(id)
);

Advantages:

  • Simple and easy to implement.

  • Efficient for direct parent or child queries.

Disadvantages:

  • Inefficient for retrieving entire subtrees.

  • Requires recursive queries or multiple joins.

b. Nested Sets Model

Uses left and right values to represent the hierarchy, allowing entire subtrees to be fetched efficiently.

Schema:

CREATE TABLE Comment (
    id INT PRIMARY KEY AUTO_INCREMENT,
    post_id INT NOT NULL,
    lft INT NOT NULL,
    rgt INT NOT NULL,
    content TEXT NOT NULL,
    author_id INT NOT NULL,
    created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Example

Advantages:

  • Efficient subtree retrieval with a single query.

Disadvantages:

  • Complicated inserts and updates.

  • Requires recalculating left and right values for many nodes.

Explanation Video

c. Materialized Paths

Stores the path to each node as a string, simplifying certain queries.

Schema:

CREATE TABLE Comment (
    id INT PRIMARY KEY AUTO_INCREMENT,
    post_id INT NOT NULL,
    path VARCHAR(255) NOT NULL,
    content TEXT NOT NULL,
    author_id INT NOT NULL,
    created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
);

Example

Advantages:

  • Simplifies ancestor or descendant queries using string matching.

Disadvantages:

  • Path updates are required when moving nodes.

  • Limited by maximum path string length.

Explanation Video

d. Closure Table

Stores all ancestor-descendant relationships, allowing efficient traversal of the hierarchy.

Schema:

CREATE TABLE Comment (
    id INT PRIMARY KEY AUTO_INCREMENT,
    post_id INT NOT NULL,
    parent_id INT NULL,
    content TEXT NOT NULL,
    author_id INT NOT NULL,
    created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (parent_id) REFERENCES Comment(id)
);

CREATE TABLE CommentClosure (
    ancestor_id INT NOT NULL,
    descendant_id INT NOT NULL,
    depth INT NOT NULL,
    PRIMARY KEY (ancestor_id, descendant_id),
    FOREIGN KEY (ancestor_id) REFERENCES Comment(id),
    FOREIGN KEY (descendant_id) REFERENCES Comment(id)
);

Advantages:

  • Efficient for querying ancestors and descendants.

  • Supports complex hierarchical queries.

Disadvantages:

  • Increased storage due to additional table.

  • Inserts and deletes are more complex.

Explanation Video


4. Choosing the Optimal Approach

Given the requirements of a Reddit-like comment system—particularly the need for efficient reads of entire comment threads—the Closure Table model emerges as the most suitable choice. It provides a good balance between read efficiency and manageable write complexity.


5. Designing the Database Schema

Here’s the finalized schema incorporating the Closure Table approach.

Comment Table:

CREATE TABLE Comment (
    id INT PRIMARY KEY AUTO_INCREMENT,
    post_id INT NOT NULL,
    parent_id INT NULL,
    content TEXT NOT NULL,
    author_id INT NOT NULL,
    created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (post_id) REFERENCES Post(id),
    FOREIGN KEY (parent_id) REFERENCES Comment(id),
    FOREIGN KEY (author_id) REFERENCES User(id)
);

CommentClosure Table:

CREATE TABLE CommentClosure (
    ancestor_id INT NOT NULL,
    descendant_id INT NOT NULL,
    depth INT NOT NULL,
    PRIMARY KEY (ancestor_id, descendant_id),
    FOREIGN KEY (ancestor_id) REFERENCES Comment(id) ON DELETE CASCADE,
    FOREIGN KEY (descendant_id) REFERENCES Comment(id) ON DELETE CASCADE
);

Additional Tables:

CREATE TABLE Post (
    id INT PRIMARY KEY AUTO_INCREMENT,
    title VARCHAR(255) NOT NULL,
    content TEXT NOT NULL,
    author_id INT NOT NULL,
    created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (author_id) REFERENCES User(id)
);

CREATE TABLE User (
    id INT PRIMARY KEY AUTO_INCREMENT,
    username VARCHAR(50) NOT NULL UNIQUE,
    email VARCHAR(100) NOT NULL UNIQUE,
    password_hash VARCHAR(255) NOT NULL,
    created_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
);


6. Implementing Comment Operations

Inserting Comments

When a new comment is added, we need to update both the Comment and CommentClosuretables.

Steps:

  1. Insert into Comment table.

  2. Insert into CommentClosure table:

    • Insert a row where the comment is both ancestor and descendant (depth = 0).

    • Insert rows for all ancestors of the parent comment.

SQL Example:

-- Step 1: Insert the comment
INSERT INTO Comment (post_id, parent_id, content, author_id)
VALUES (?, ?, ?, ?);

-- Assume the new comment's ID is LAST_INSERT_ID()

-- Step 2: Insert into CommentClosure
-- Insert self-reference
INSERT INTO CommentClosure (ancestor_id, descendant_id, depth)
VALUES (LAST_INSERT_ID(), LAST_INSERT_ID(), 0);

-- Insert ancestor paths
INSERT INTO CommentClosure (ancestor_id, descendant_id, depth)
    SELECT ancestor_id, LAST_INSERT_ID(), depth + 1
    FROM CommentClosure
    WHERE descendant_id = ?; -- parent_id

Retrieving Comment Trees

To display a comment thread, we need to fetch a comment and all its descendants.

SQL Query:

SELECT c.*, cc.depth
FROM CommentClosure cc
JOIN Comment c ON cc.descendant_id = c.id
WHERE cc.ancestor_id = ?
ORDER BY cc.depth ASC, c.created_at ASC;

Parameters:

  • Replace ? with the ID of the root comment or post.

7. Performance Considerations

  • Indexing: Ensure indexes are created on foreign keys and commonly queried columns.

  • Caching: Implement caching strategies for frequently accessed comment threads.

  • Pagination: Load comments in chunks to improve response times.

  • Denormalization: In some cases, denormalizing data can improve read performance at the expense of write complexity.


Soon Youtube Explanation Video will be added for this on our Channel - http://youtube.com/@subhahu


Stay tuned for more in-depth articles on Low Level Design, database design and system architecture at LLDCoding.com.

Thanks for reading LLDcoding’s Substack! Subscribe for free to receive new posts and support my work.

Did you find this article valuable?

Support Low Level Design (LLD) Coding by becoming a sponsor. Any amount is appreciated!