5 AI Real-World Projects To Set Foot in The Door

Posted on July 31, 2024

Don’t just learn Data Science — do it! The best way to do Data science is to build real-world projects that spark your passion and resonate with you.

No matter where you are in your Data science journey, you can always roll up your sleeves, get your hands dirty and experiment with things. This helps you connect the dots and challenge your understanding.

If you are new to the world of AI and LLMs and want to get your feet to the door, I think the following real-world projects (in order of complexity) are good gateways into the field. Even though prompt engineering is such an important aspect when working with (generative) AI models, we will skip it in this article.

Here is the agenda for today

· What to look for in AI Projects?
· Project 1: Build a RAG chatbot to ask anything about books! 📚
· Project 2: Build an autonomous Agents: everything-about-book 📚
· Project 3: Train your own LLM (a song writer 🎶🎸🎹)
· Project 4: Fine-tune a Bert model to understand legal texts 👩‍⚖️
· Project 5: Model Evaluation

What to look for in AI Projects?

The term “artificial intelligence” was firstly used as early as the 1800s, though its occurrences were relatively minuscule.

Some ideas surrounding artificial intelligence already existed in the 19th century. In 1872, Samual Butler published Erewhon, which contains a story about a fictional land where machines evolved according to Darwin’s theory of evolution, but at a much higher pace and obtained consciousness and surpassed humans in every aspect. Very fictional 150 years ago, but today it’s not entirely unimaginable.

The 1940–1960s period was a golden era for AI discovery. Even though the landscape changed very quickly in the last decade with huge amount data and computing power, Artificial Intelligence has been around for quite a while.

alt text

The term Artificial Intelligence as how we often use it today was officially coined in the Dartmouth AI Workshop in 1956**.** These day, when you people talk about AI, they often refer to Generative AI, which is a subset of Machine Learning and Deep Learning.

When exploring AI projects, in my opinion, we would want to prioritise those that offer:

Theoretical fundamentals and AI Concepts: Grasp the fundamental theories, principles and core concepts in the field of AI.
Development of AI products: Get hands-on experience by applying frameworks and building practical applications. This helps to validate your understanding and improve your technical skills
Evaluation: Learn how to assess and refine the performance of your AI applications.

Project 1: Build a RAG system to ask anything about books! 📚

Imagine you have a whole database about books (📚) and you want to retrieve the relevant books given your question and answer the question about certain books, this is a perfect use case to create a document retrieval app using RAG.

>» What will you create?

We will create a RAG system that, given a user query, returns the relevant books from our database and answer any questions about books! 📚📚📚

>» Skills you will learn

RAG system
Create vector embeddings
Store and query embeddings using a vector stores/databases (e.g., FAISS, Qdrant, Chroma)
Combine vector stores and LLM for information retrieval

>» Fundamental theories and concepts

👉 What is Retrieval Augmented Generation (RAG) system?

A RAG-based architecture provides an LLM (i.e Claude3.5) with access to external sources of knowledge that provide additional context to the user query. This typically involves searching by similarity to the query, retrieving the most relevant documents, and inserting them into the prompt as context for information retrieval.

RAG is used to solve hallucinations in open-ended scenarios, like a user talking to a chatbot that is prone to making things up when asked about something not in its training data.

Here’s how the process works for RAG:

Break the document into Chunks
Turn each chunks into vector embedding and index the chunks in a vector database
Query: given user input, vectorize user input, search by vector for closest records in the vector database and retrieve relevant context
Generate: Combine query and relevant context, get LLM response

👉 Embeddings and vector stores/databases

Although embedding models have been available long before the advent of generative AI. Generative AI models have given rise again to the vector representation of text, or _word embeddings, which is a fancy way of saying that text or images can be presented as a list of number. F_or example, you might think of as coordinates for a location.

You can compute Paris — France + Netherlands, and the result is the vector embedding close Amsterdam, which seems to show that the notion of capital city was encoded in the embeddings.

Here is another famous example: if you compute King — Man + Woman (adding and subtracting the embedding vectors of these words), then the result will be very close to the embedding of the word Queen. It seems like the embeddings encode the concept of gender!

When you ask ChatGPT a question, under the hood your question will be converted into an embedding/vector that ChatGPT can understand. That embedding is indexed and stored in a vector database. A vector database stores the text records with their vector representation as the key. This technology helps reduce hallucinations by referencing relevant context ChatGPT isn’t trained on in the prompts, so that it can use this context in calculating the response.

>» Implementation Steps

Techstack:

LLM Framework: Langchain. It provides you lots of components to work with LLMs
Foundation model: GPT4o
Vector storage: Qdrant (you can use Chroma or FAISS)
Front-end: Holoviz Panel (alternative could be Streamlit)
Embedding model: OpenAI text-embedding-large-03

👉 Step 1: Set up the Environment

First, ensure you have the necessary libraries installed:

uv pip install –upgrade langchain openai qdrant-client pandas nltk tomotopy pyvis

👉 Step 2: Scrape book data

Function details omitted for brevity, please see in this repo:

def scrape_book():
""""""
# (Function implementation details omitted for brevity)
# This function would include scraping from google book using google API
# and reviews from amazon using Selinium
return df_books

👉 Step 2: Setting up the vector database

First we need to create the embeddings object and set up a vector database to store the embedding of our book data. I will be using OpenAI text-embedding-3-large for generating embeddings.

embeddings = OpenAIEmbeddings(model=“text-embedding-3-large”)
def create_db(documents):
return Qdrant.from_documents(
documents=documents,
embedding=embeddings,
collection_name=“my_documents”,
location=":memory:",
force_recreate=False,
)
db = create_db(documents)

When setting up the vector database, we pass location=”:memory:” to specify that the database should be created in memory and that we plan to interact with it in the same session.

👉 Step 3: Information retrieval using relevant context

Next, we take a user query, search the database and return a list of relevant documents. Here, there are some parameters you can tweak, for example the search space (k numbers of documents to return) or similarity type ( similarity_score_threshold, maximum marginal relevance mmr):

retriever = db.as_retriever(
search_type=“mmr”, search_kwargs={“k”: 2, “lambda_mult”: 0.25}
)
# Create a chain to answer questions
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type=“stuff”,
retriever=retriever,
return_source_documents=True
)
query = “Can you tell me about the key theme for the book Life 3.0 in 20 words?”
result = qa({“query”: query})

And there is that! A system that searches by similarity to the query, retrieving the most relevant documents and use that as context for answering user question:

>» Useful Resources

📚 Prompt engineering for Generative AI (James Phoenix and Mike Taylor)
📚 AI Engineering (chip Huyen)

Project 2: Build an autonomous Agents: everything-about-book 📚

Generative AI models have given rise to agent-based architecture. If you want to understand how agents works and build one from scratch, I have an article on that.