NGC-04 · Deep Field

All Observation Reports

Deep-field studies — zooming in on individual objects to examine methods, constraints, and outcomes.

NGC-OBS-001 ✧

A Data-Driven Exploration of Food

⊹ The Problem

Food waste and economic inflation are reshaping how people cook — but recipe generation tools in 2023 were blunt instruments, pattern-matching on popularity rather than understanding why ingredients work together. The question I wanted to answer: could AI learn the underlying relationships between ingredients well enough to suggest combinations nobody had thought of yet?

⊹ The Approach

I built a knowledge graph of ingredient relationships in Neo4j, pulling recipe data from AllRecipes via the Apify API and nutritional data from FooDB. Ingredient embeddings were generated using Word2Vec, then a GraphSAGE neural network was trained on the graph structure to learn deeper relational patterns. Dimensionality reduction techniques — PCA, UMAP, and t-SNE — were used to surface clusters and outliers in the embedding space. The results were fed to GPT-3.5 via the OpenAI API to generate novel ingredient pairings and full recipes from the learned relationships.

⊹ Results

Generated unconventional but chemically plausible pairings — including broccoli and blueberries — that the model identified as relationally close despite rarely appearing together in recipes
Produced novel recipes from scratch, including Broccoli and Ham Stuffed Eggplant with a Blueberry and Jalapeño Glaze — which is either inspired or unhinged, possibly both
Identified a measurable Western bias in the RecipeNLG dataset, flagging a broader issue with diversity in culinary training data
Demonstrated a viable end-to-end pipeline from raw recipe data to generative AI output via a graph neural network

⊹ Reflection

I finished this project in 2023, when large language models were just beginning to go mainstream — and looking back, the timing was both exciting and limiting. The pipeline holds up. What I'd do differently now: ground the ingredient relationships in actual chemical compound data rather than co-occurrence patterns, and integrate allergen information so the output is genuinely useful rather than just novel. I'd also push the model comparisons further — I compared dimensionality reduction techniques but not the generative models themselves, and that's a gap a stronger paper would have closed. Three years on, the infrastructure to do this properly is dramatically better. This one isn't finished — it's just waiting for a second chapter.

← Return to Observatory