Goodreads book review dataset template

Published Thu, April 3, 2025 ∙ Templates, Product, AI ∙ by Johanan Ottensooser

My wife is an author, so of course, when looking for additional templates to add for AI playground, the Goodreads database I stumbled across on Kaggle hit pretty hard!

Unlike the ADSB dataset, this is a single bulk ingestion from a website. However, I think it shows off a couple of cool things our advanced tools were able to do, including (1) figuring out how to use Kaggle's authentication and python SDK, (2) turning the bulk ingest into a stream of individual book datapoints for the ingestion API; (3) previewing the data for the user in the ingestion script.

Also, it is just super interesting to analyse book data!

Here's a sample of the data:

bookIDtitleauthorsaverage_ratingisbnisbn13language_codenum_pagesratings_counttext_reviews_countpublication_datepublisher
1Harry Potter and the Half-Blood Prince (Harry Potter #6)J.K. Rowling/Mary GrandPré4.5704397859609780439785969eng6522095690275919/16/2006Scholastic Inc.

Here's how you get it running…

Requirements

Nice to haves

These will be required if you want to interrogate this data with our MCP tools:

Install Moose / Aurora

bash -i <(curl -fsSL https://fiveonefour.com/install.sh) moose,aurora

This will install Moose (our open source developer platform for data engineering) and Aurora (our AI data engineering product).

🔑 It will ask you for your Anthropic API key, again, if you don't have it, here's the setup guide.

Create a new project using the Goodreads template configured with Claude Desktop

aurora init books goodreads --mcp claude-desktop

This will use Aurora to initialize a new project called "books" based on the templated "goodreads", whilst configuring Claude Desktop to have the Aurora MCP tools built out with respect to this.

Then, you'll need to run a few more commands to get things ready to go:

cd goodreads
npm install

This will install the dependencies you have.

☸️ Make sure Docker Desktop is running before the next step!

moose dev

This will run the Moose local dev server, spinning up all your local data infrastructure including Clickhouse, Redpanda, Temporal and our Rust ingest servers.

Add your Kaggle authentication key to the project's directory

Add your Kaggle authentication key to the root directory, it should be books/Kaggle Settings.json and have the structure:

{"username":"your_username","key":"your_key"

If you don't have one yet, you can get one here: https://www.kaggle.com/docs/api

Run the ingestion script

In a new terminal, navigate back the the project directory

cd path/to/books

You'll know you are in the correct directory if the moose-config.toml is in the directory.

Then run the python ingest script!

python ingest_goodreads_data.py

This will grab a sample of the data, ask you if this is conformant with your expectations, then send batches of data to the data model ingestion point.

If you go back to your original terminal running the Moose dev server, you'll see hundreds of incoming datapoints.

Explore your data in Claude

Ask any questions you might like of Claude!

We had fun with

What's the greatest correlate with book ratings? year? number of pages? whether the book was a sequel?

and

what are the type of books that I should invest in as a publisher if I want to maximize my return (the best investments being non-obvious books that are more affordable to acquire with higher sales).

Productionize your results with Cursor

So first, lets configure Cursor with the Aurora MCP tool-suite pointed at this project.

Navigate to the project directory and open Cursor

cd path/to/your/project
cursor .

Then, run the Aurora command to configure the MCP for cursor

aurora setup --mcp cursor-project

This will create a /.cursor/mcp.json file configured for Aurora's MCP—whenever a this Cursor project is running, this MCP will be started.

If you go to cursor>settings>cursor settings>MCP you'll see the server.

Click enable and refresh, and you should be ready to go!

One line of questioning we liked here was

Create an API that takes in a year and returns the top 50 books for that year ranked by review (excluding any outliers with small numbers of reviews)

Careers
We're hiring
2025 All rights reserved