Introducing Aurora Pre-Alpha Preview: API to ClickHouse demo

Published Fri, January 17, 2025 ∙ AI, Educational, Announcements, Product ∙ by Johanan Ottensooser

⚠️

This is an alpha release and may still be unstable. Please join our community to share feedback or email me directly.

Aurora is an AI tool created by Fiveonefour to assist in the data engineering tasks that a software engineer or a data product team might need to do for their work.

Aurora leverages the open source Moose data engineering framework by Fiveonefour to reduce the surface area of the code that the AI has to write, increasing the performance, scalability and editability of the code created by it.

Aurora is written on top of Meridian, a Rust AI framework written by Fiveonefour, to create and manage context and interact with LLM providers. All context is stored locally for now, but remote context management is coming soon.

Aurora is created to be able to deliver composable data engineering workflows:

  • Making data accessible [pre-alpha]
    • creating “connections” to a variety of sources (or assisting with data generation) and landing that data in an OLAP database [pre-alpha]
    • transforming that data into a data product [coming soon]
  • Consuming data [coming soon]
    • making that data accessible through a variety of egress methods
  • Modifying existing data flows [coming later]
  • Monitoring and maintaining data flows [coming later]

As well as managing complex combinations of all of these [coming later].

Demo

We’re releasing a pre-alpha of one of the workflows (ingest from API, get data into Clickhouse) for feedback, specifically around developer ergonomics.

⚠️

This is an alpha release and may still be unstable. Please join our community to share feedback or email me directly.

In this demo, you’ll be able to specify an arbitrary API, and Aurora will stand up the infrastructure, scripts and code necessary to get data from that API to Clickhouse.*

It will do so scalably, creating a script to get the data, a Rust webserver to ingest the data, Redpanda topics to stream the data and tables in Clickhouse to receive the data, with quality guarantees created by the inferred data model.

Coming soon, Aurora will create a cron scheduler for that script, so you can build a time series of data from the API. For now, you can build one manually per the Moose Docs.

  • Most authenticated and unauthenticated APIs will work. Parameterized APIs will not work for now, but are coming.

Set-up

Pre-requisites

You can run Aurora as long as you are on a mac/linux. However, the workflow that we are previewing in this pre-alpha requires Moose, Fiveonefour’s open source data engineering framework, in order to run. Moose has the following pre-requisites:

Install Aurora

You can optionally specify a version of Aurora to install. If you don’t specify a version, the latest release will be installed.

Configure Aurora

Aurora’s configurations are set here:

Here’s an example valid configuration file

LLM API Keys

⚠️

You must add either your OpenAI API key openai_api_key or Anthropic API anthropic_api_key key to the configuration file.

You must add the corresponding API key to the Provider/Model you select as your default Provider/Model.

Default Provider and Model

You must select a default provider default_provider. You can optionally select a default model default_model . The default model must be offered by the default provider.

Walkthrough of pre-alpha demo → Getting data from an API into ClickHouse

This demo highlights how Aurora can help you with data engineering tasks. Here, a simple task of getting data from an API* into ClickHouse. More features in this flow will be coming shortly (see below) and additional flows are coming soon too.

Create a Moose Project

In the flow we are demonstrating, Aurora is creating an ingestion pipeline from an API to Clickhouse using Fiveonefour’s open source data engineering framework, Moose. Accordingly, the last step in the set-up is creating a Moose project for aurora to write into.

  1. moose-cli init test-project typescript --empty
    1. this will create a Moose project without example primitives, called test-project . Feel free to call the project whatever you like
  2. cd test-project
  3. npm install installs the dependencies you need to run this project.

Start a session with Aurora

Start a session with Aurora:

⚠️

As mentioned above, an aurora session must be started from a Moose project. The initial checks will fail if it is started elsewhere.

  • Aurora will ask you to run certain checks to ensure that the environment is ready for Aurora.

    • Checking that node is installed

    • Checking that npx is installed

    • Checking that the Moose CLI is installed

    • Checking that you are in a Moose project

    Aurora asks explicit permission for each of these steps since it is running commands in your terminal on your behalf.

  • Aurora will then ask you permission to watch Moose’s logs, storing the logs as additional context for Aurora to use to debug the primitives created by Aurora.

Connecting to an API

You are now in a session with Aurora.

  • It will first ask you where you are getting data from. For this pre-alpha release, only APIs are available. Other sources are coming soon.

    The APIs available are standard APIs that are not parameterized. Parameterized APIs are coming soon.

    Authenticated APIs are supported, but only those that use bearer tokens and x-api-key header tokens. More authentication methods are coming soon.

  • The CLI will then ask for your URL.

    If you want an URL of an open API to test with, here’s a good example https://api.coindesk.com/v1/bpi/currentprice.json

  • If this is an authenticated API, it will ask for your token.

⚠️

Note, if you make a mistake here, exit out of the aurora session ctrl+c , clear the context rm -rf /app/context and start a new aurora session aurora start. A nicer workflow for this is coming soon.

Create rich context for the session

  • The CLI will then show you the sample data from that API. Confirm if it conforms with your expectations. Aurora will include the sample data in the context for the rest of the session.

  • The CLI will then ask for a documentation link, which is optional. Aurora will include the documentation in the context for the rest of the session.

    Just hit return to skip if there is no documentation API.

  • The CLI will then load Moose’s latest documentation to context.

⚠️

Note, if you make a mistake here, exit out of the aurora session ctrl+c , clear the context rm -rf /app/context and start a new aurora session aurora start. A nicer workflow for this is coming soon.

Aurora creates and validates generated data engineering primitives

  • The first “AI step” will be Aurora generating a name for a data model based on the sample data and documentation.

    The model knows, in this example, the subject matter of the data and that it is a “raw” data model (based on the source’s formatting), rather than a “clean” data model.

    You can optionally edit this name.

  • Aurora will then generate the Moose Data Model for the source. This will be used to spin up the data engineering infrastructure for this workflow. The infrastructure includes an ingest server, a Redpanda Topic, a Clickhouse table.

    Aurora will then try the data model, and if there are errors in the data model (e.g. above, the <key> is not in the top level node of the data model), it will use the errors in the logs to regenerate the data model.

  • Aurora will then ask if you would like it to create a script that pings the API and sends the data to the ingest server created in the previous step. Aurora will then test the ingestion script.

    ⚠️

    Note, Aurora will only test if the script is valid, not that the script and data model work well together. Script integration testing and retry are coming soon.

See what Aurora has generated

Open the project folder in your text editor of choice, and you’ll see that Aurora has generated:

  1. a data model
  2. an ingestion script

Soon, Aurora will be able to generate:

  • cron jobs that run the ingestion script periodically
  • more data models, including mutating existing data with natural language data model definition
  • streaming functions that act on data flowing from one data model to another

Running Moose server yourself

If you want to play with the Moose project:

  • Kill the Moose containers in Docker Desktop
  • Run moose dev
  • Run the script you created and watch the data flow into the Clickhouse instance!

If you want to try again

⚠️

Aurora gathers and stores its context in a context folder /app/context/. It looks for that context when running a workflow. If you want to run a workflow again, clear the context rm -rf /app/context and then run aurora again aurora start.

Coming soon to this workflow

  • More coverage for accepted API types
  • Other source types, like web-hooks, databases, files and file stores
  • Better, deterministic data model typing
  • Plain language data model mutation and streaming function definition

Other workflows coming soon

  • Database functions
  • Interactive egress definition
  • End to end project planning and meta-workflows