← Catalogue Modern Skills 250 level Created by AI

Modern Skills

Run AI on Your Own Computer: A Plain-English Guide to Local Models

Name: Run AI on Your Own Computer: A Plain-English Guide to Local Models
Availability: InStock
Author: Sikh Archive

Professor: Sikh Archive Source: Sikh Archive

Begin course 12 lessons · 8-question test · 80% to pass

Created by AI. Drafted with AI and reviewed for accuracy. Spotted an error? Tell us.

Prerequisite recommended.

What you'll learn

Explain in simple words what an open-weight AI model is and how it differs from a closed cloud model.
Describe what tools that run models on your own machine actually do, in everyday language.
Read a model's size and quantization label and guess whether it will fit on your computer.
Estimate the RAM and hardware you need to run a small, medium, or large model.
Weigh the privacy, cost, and offline benefits of local AI against its trade-offs versus the cloud.
Take the first practical steps to install a runner, download a model, and chat with it.

Key terms — ਸ਼ਬਦਾਵਲੀ

Open-weight model

An AI model whose learned numbers (its 'weights') are made public so anyone can download and run it themselves.

Weights

The huge list of numbers a model learned during training. They are the model. Running a model means doing math with these numbers.

Local AI

Running the model on your own laptop or desktop instead of sending your words to a company's servers over the internet.

Runner / inference tool

A program that loads a model file and lets you chat with it. Examples include desktop apps and command-line helpers.

Quantization

Shrinking a model by storing its numbers with less detail (like rounding). It makes the file smaller and faster, with a small quality cost.

Parameters

The count of weights in a model, usually written like 7B (7 billion). More parameters often means smarter but heavier.

RAM / VRAM

The fast memory your model has to fit into. RAM is general computer memory; VRAM is the memory on a graphics card (GPU).

Token

A small chunk of text (roughly a word piece) that the model reads and writes. Speed is often measured in tokens per second.

Lessons

1. Why Run AI on Your Own Machine?

Course Contents

Why Run AI on Your Own Machine?
What 'Open Weights' Really Means
The Tools That Run Models for You
Model Sizes and Quantization Made Simple
Hardware: Will It Fit on My Computer?
Trade-offs and Your First Steps

When you use a popular AI chatbot online, your words travel over the internet to a company's computers. The model 'thinks' there and sends an answer back. That is cloud AI. It is easy and powerful, but you are renting someone else's computer and trusting them with what you type.

Local AI flips this around. You download a model file onto your own laptop or desktop, and the thinking happens right there. Nothing leaves your machine. No internet needed once it is set up.

People choose local AI for a few plain reasons:

Privacy. Your notes, journals, or work documents never leave your computer.
Cost. After the free download, there is no monthly bill and no per-question charge.
Offline. It works on a plane, in a village with no signal, or when the internet is down.
Control. The model will not change or disappear on you. It is yours to keep.

The catch is that a model running on your own computer is usually smaller and a bit less capable than the giant ones in the cloud. The rest of this course explains the words you need, the tools that make it easy, and how to start.

References: Mozilla AI guide to running open-source LLMs locally; MIT Technology Review on open-weight models.

Homework

Reflect on one situation in your personal or professional life where data privacy matters to you. Write 300–400 words describing what information you would not want sent to a cloud server and why. Then consider how running AI locally might address that concern. This is not a technical exercise — focus on your values and what privacy means in your context.

Your notes — saved on this device

2. What 'Open Weights' Really Means

Every AI model is, at heart, a giant pile of numbers called weights. During training, the model adjusts these numbers until it gets good at predicting text. Once training is done, those numbers are the model. To use the model, your computer just does math with them.

A closed model keeps its weights secret. You can only reach it through the company's website or app. You never get the numbers themselves.

An open-weight model is the opposite: the company publishes the weights so anyone can download the file and run it on their own machine. This is what makes local AI possible. Note that 'open weights' is not always the same as fully 'open source' (which would also share the training data and code), but for running a model at home, having the weights is what counts.

Why does this matter to you?

You can run the model with no permission and no account.
You can use it privately and offline.
Many open-weight models are free for personal use; always glance at the licence.

Think of a closed model as a meal you order at a restaurant, and an open-weight model as a recipe you take home and cook yourself. Both feed you, but only one lets you keep the recipe.

References: Hugging Face model hub overview; MIT Technology Review on open-weight AI.

Homework

Spend time browsing the Ollama model library at ollama.com/library or the Hugging Face model hub. Find one open-weights model you did not know existed before this lesson. Write a 300-word profile of that model: who released it, what it is designed for, what its license allows and restricts, and whether you would trust it for your own use. Reflect on what 'open' means to you after reading the license.

Your notes — saved on this device

3. The Tools That Run Models for You

You do not need to be a programmer to run a model. A runner (also called an inference tool) is a program that loads the model file and gives you a chat box. There are three common styles:

Simple desktop apps. You install one program, click to download a model from a built-in list, and start chatting in a window. This is the easiest path for most people.
Command-line helpers. You type a short command and the tool downloads and runs the model in your terminal. Great for tinkerers and for connecting models to other software.
Lightweight engines. The core technology many tools are built on. It is highly efficient and can run models even on ordinary computers without a fancy graphics card.

All of these do the same basic job: take your message, feed it through the model's weights, and stream back an answer one piece at a time. Many also offer a small built-in 'server' so other apps on your computer can talk to the model.

You do not need to pick the 'perfect' tool. Start with a simple desktop app, see if you like local AI, and explore the others later. They all run the same kinds of open-weight model files.

References: Ollama documentation; LM Studio documentation.

Homework

Install one of the three tools discussed in this lesson — Ollama, LM Studio, or Jan — on your own computer. If you cannot install software, research one tool in depth by reading its official documentation. Write 300–400 words describing the installation or research experience: what was easy, what was confusing, and what one question arose that you would want answered before using it seriously.

Your notes — saved on this device

4. Model Sizes and Quantization Made Simple

Model names often include a number like 3B, 7B, 13B, or 70B. The 'B' means billions of parameters (weights). More parameters usually means a smarter model, but also a bigger file and more memory needed. A 7B model is a comfortable middle ground for many home computers.

You will also see labels like Q4, Q5, or Q8. This is quantization. The model's numbers are normally very precise, which makes the file large. Quantization rounds them off to save space, like writing 3.14 instead of 3.14159265. A lower number (Q4) means smaller and faster, with a small drop in quality. A higher number (Q8) keeps more quality but needs more memory. For most people, Q4 is a great balance.

Here is a rough guide for how a 7B model shrinks with quantization:

Quantization	Quality	Approx. file size (7B model)	Speed
Full precision (no quant)	Best	~14 GB	Slowest
Q8	Very high	~7-8 GB	Slower
Q5	High	~5 GB	Medium
Q4	Good (recommended)	~4 GB	Fast

So the size of the file you download depends on both the parameter count and the quantization. A small, well-quantized model can fit on a modest laptop.

References: Hugging Face quantization guides; LM Studio documentation on model formats.

Homework

Using a publicly available tool such as the Ollama model library or a model comparison site, look up two models of different sizes — for example a 1B and a 7B variant of the same family. Write 350 words comparing what you find: how large are the files, what quantization levels are offered, and what does the documentation say about the quality difference. Reflect on what 'good enough' accuracy means for the task you care about most.

Your notes — saved on this device

5. Hardware: Will It Fit on My Computer?

The single most important question is: will the model fit in memory? A model has to load into your fast memory to run. That memory is either your computer's main RAM or, if you have a graphics card, its VRAM. A graphics card (GPU) makes answers come back much faster, but many small models run fine on the regular processor (CPU) using ordinary RAM, just more slowly.

A simple rule of thumb: the model needs roughly its file size in free memory, plus a little extra for working room. So a 4 GB model file wants around 5-6 GB of free memory.

Model size (Q4)	File size	Recommended memory	Good for
1B-3B	~1-2 GB	8 GB RAM	Older or budget laptops, quick tasks
7B-8B	~4-5 GB	16 GB RAM	Most modern laptops; the sweet spot
13B	~8 GB	16-32 GB RAM	Strong laptops and desktops
70B	~40 GB	48-64 GB RAM or a big GPU	Powerful workstations only

Apple computers with the M-series chips are especially handy here, because their memory is shared between the processor and graphics, so a model can use most of it. On Windows or Linux, a graphics card with 8 GB or more of VRAM gives a big speed boost.

If your first model feels slow, pick a smaller one or a lower quantization. It is normal to experiment until you find a model that is both useful and comfortable on your hardware.

References: Ollama hardware guidance; LM Studio system requirements documentation.

Homework

Take an inventory of your own computer or a computer you have access to. Note the RAM, the GPU (if any) and its VRAM, and whether it has an Apple Silicon chip or a standard CPU. Using what you learned in this lesson, determine the largest model you could realistically run. Write 300 words describing your hardware situation and your conclusion, and identify one model from the Ollama library that fits your setup.

Your notes — saved on this device

6. Trade-offs and Your First Steps

Local AI is wonderful, but it is fair to know the trade-offs before you start.

Where local AI wins: privacy (your data stays put), no monthly cost, works offline, and full control over which model you keep.

Where the cloud still wins: the biggest cloud models are usually smarter and more up to date, they need no setup, and they run on someone else's powerful hardware so your laptop fan stays quiet. Local models are also frozen in time at their training date and may know less about very recent events.

Many people end up using both: a private local model for personal or sensitive tasks, and a cloud model when they need maximum power.

Your first steps:

Check your computer's memory. 16 GB of RAM is a comfortable target.
Install one simple desktop runner app.
From its built-in list, download a small, popular model, ideally a 7B or 8B model in Q4.
Open the chat window and type a question, just like any chatbot.
If it feels slow, try a smaller model or a lower quantization.
Once comfortable, explore command-line tools or larger models.

That is the whole journey. Within an afternoon you can have a private, free, offline AI assistant living on your own machine, ready whenever you are.

References: Ollama getting-started documentation; Mozilla AI guide to running LLMs locally.

Homework

Identify one task you currently do with a cloud AI tool — drafting text, summarizing documents, answering questions, writing code, or anything else. Map that task against the trade-offs discussed in this lesson: speed, quality, cost, and privacy. Write 400 words explaining which factor matters most to you for that task, whether local AI is a realistic substitute today, and what your actual first step will be to try running a model locally within the next two weeks.

Your notes — saved on this device

7. Setting Up Ollama: Your First Model in Under Ten Minutes

Introduction

By now you understand why local AI matters, what open-weights models are, and roughly how quantization keeps file sizes manageable. This lesson moves from theory to practice. You are going to install Ollama — the most widely used tool for running language models on a personal computer — and have a working conversation with a model before this lecture is over. The goal is not to make you a power user today; it is to remove the psychological barrier that separates reading about local AI from actually doing it.

Ollama works by wrapping a highly optimized inference engine called llama.cpp inside a clean command-line interface and a background server. When you type a single command, Ollama downloads a pre-quantized model file in GGUF format, loads it into memory, and exposes a local API that other programs can talk to. You do not need to understand all of that right now. What matters is that the entire process — from a blank computer to a running conversation — takes fewer than ten minutes on a modern machine.

This lesson covers installation on macOS, Windows, and Linux, how to pull and run your first model, the basic commands you will use every day, and how to verify that everything is working correctly. It also explains what is happening behind the scenes at each step, so you are not just following instructions blindly but building a mental model you can troubleshoot later.

Installing Ollama and Understanding What It Does

Ollama is distributed as a single downloadable application. On macOS you download a .dmg file from ollama.com and drag it to your Applications folder, exactly like any other Mac app. On Windows you run an installer .exe. On Linux you paste a one-line shell command that downloads and configures the service automatically. In all three cases, Ollama installs a background process that starts automatically when your computer boots and listens for requests on port 11434.

That background process is the Ollama server. Every time you run a model, your terminal or any application you connect is actually sending HTTP requests to this local server, which loads the model weights and streams back the response. This architecture matters because it means any program — a Python script, a web application, a productivity tool — can talk to your locally running model using the same API format that cloud providers like OpenAI use. You can swap a cloud model for a local one in many applications just by changing a URL and removing an API key.

After installation, open a terminal and run ollama --version to confirm the tool is available. You should see a version number printed immediately. If you see an error instead, the most common cause on macOS is that the application was not opened at least once after installation — Ollama needs one manual launch to register the background service. On Linux, you may need to start the service manually with systemctl start ollama depending on your distribution.

Once Ollama is running, the model library at ollama.com/library shows every model available for direct download. Each entry lists the model family, available sizes, quantization levels, and a one-line pull command. The library is curated — models listed there have been packaged and tested by the Ollama team — but Ollama also supports importing any GGUF file from Hugging Face using the syntax ollama run hf.co/username/repo-name, which opens the entire Hugging Face ecosystem.

Pulling and Running Your First Model

The command to download a model is ollama pull model-name. For a first experiment, ollama pull llama3.2 downloads Meta's Llama 3.2 3B model, which is approximately 2 GB in its default Q4_K_M quantization. This fits comfortably in 8 GB of RAM and runs at a usable speed even on a CPU alone. If your machine has less than 8 GB of RAM, try ollama pull qwen2.5:1.5b instead — the Qwen 2.5 1.5B model is under 1 GB and still surprisingly capable for general conversation and simple tasks.

While the model downloads you will see a progress bar. Once it completes, run ollama run llama3.2 and Ollama loads the model and drops you into an interactive chat session. Type any question and press Enter. The model streams its response token by token, just as you see in web-based AI tools. To end the session, type /bye or press Ctrl-D. The model is unloaded from memory shortly after — by default Ollama keeps a model in RAM for five minutes after the last request, then releases it.

A handful of commands cover most daily use. ollama list shows every model you have downloaded. ollama rm model-name deletes a model and reclaims disk space. ollama ps shows which models are currently loaded in memory. ollama show model-name prints details about a model including its parameter count, quantization level, context window size, and the system prompt it uses by default. Knowing these four commands puts you in control of your local model library.

Speed depends heavily on hardware. On Apple Silicon Macs — M1, M2, M3, or M4 — Ollama uses the Neural Engine and GPU through Apple's Metal framework, and a 7B model typically generates 30–60 tokens per second, fast enough to feel instantaneous in conversation. On a modern CPU without a discrete GPU, expect 5–15 tokens per second for a 7B model, which is readable but noticeably slower. On a machine with an NVIDIA GPU, Ollama automatically detects CUDA and offloads computation to the GPU, typically matching or exceeding Apple Silicon speeds.

Testing, Troubleshooting, and the Local API

After your first successful conversation, test the local API directly. Open a second terminal while the model is running and type curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Say hello in one sentence.","stream":false}'. You should see a JSON response containing the model's reply. This single test confirms two things: the model is running, and external programs can reach it. If you get a connection refused error, run ollama serve in a terminal to start the server manually.

The most common issues beginners encounter are model downloads failing partway through (usually a network interruption — simply run ollama pull again and it resumes), models running out of memory (choose a smaller model or quantization level), and GPU not being detected (NVIDIA users may need to install CUDA drivers separately). The Ollama GitHub repository has an active issues tracker where solutions to common problems are usually already documented.

Ollama also exposes a chat endpoint at /api/chat that accepts a conversation history in the same format as the OpenAI chat completions API. This compatibility is intentional and powerful. Tools like Open WebUI — a browser-based chat interface you can run locally — connect to Ollama out of the box, giving you a polished ChatGPT-style interface for your local models without writing any code. Installing Open WebUI via Docker takes about five additional minutes and transforms the experience from a terminal curiosity into a daily-use tool.

Key Terms

llama.cpp — An open-source C++ library that performs efficient inference on quantized language models; the engine underlying Ollama.
GGUF — A binary file format for storing quantized model weights, used by llama.cpp and supported by Ollama and LM Studio.
Inference — The process of running a trained model to generate output; distinct from training, which creates the model in the first place.
Context window — The maximum number of tokens a model can consider at once; determines how much conversation history or document text the model can hold in mind.
Token — The basic unit a language model processes, roughly corresponding to a word fragment; models generate output one token at a time.
Open WebUI — A self-hosted, browser-based chat interface that connects to Ollama, providing a graphical front-end for local models.

Discussion Questions

Ollama's design hides a great deal of complexity behind simple commands. What are the advantages of this approach for new users, and what might an expert user want access to that Ollama's defaults conceal?
The local API on port 11434 makes your running model accessible to any program on your machine. What security considerations arise from this, and how might you address them on a shared computer?
Ollama automatically selects Q4 quantization when you pull a model without specifying a level. Is this a reasonable default, or should users be required to make an explicit choice? What does the decision reveal about the tool's design philosophy?
How does the experience of running a model locally compare to your expectations before trying it? What surprised you, positively or negatively?

Key Takeaways

Ollama reduces the installation process to a single download and three commands, making local AI accessible without technical expertise.
The background server architecture means any application can talk to your local model using familiar HTTP requests, not just the terminal.
Hardware determines speed more than anything else: Apple Silicon and NVIDIA GPUs provide the fastest experience, but CPU-only machines can still run small models usefully.
A working local model setup is a foundation you can build on — connecting graphical interfaces, integrating with scripts, and eventually running multiple models for different tasks.

Homework

Install Ollama on your own computer and run at least one successful conversation with any model from the library. Then run ollama list and ollama show model-name for the model you downloaded. Write 350 words describing what you did, what you observed about speed and response quality, and one thing the model got right and one thing it got wrong in your test conversation. If you cannot install software on your device, research the Open WebUI project and write 350 words explaining what it adds on top of Ollama and who would benefit most from it.

Your notes — saved on this device

8. Choosing the Right Model for the Job

Introduction

One of the most disorienting moments for someone new to local AI is opening the Ollama model library and seeing dozens of model families — Llama, Mistral, Qwen, Gemma, Phi, Falcon, Command-R, DeepSeek — each with multiple size variants and quantization levels. The natural question is: which one should I use? The honest answer is that no single model is best for every task, and learning to match a model to a task is one of the most valuable skills you will develop as a local AI practitioner.

This lesson gives you a practical framework for model selection. It covers the major model families available as of 2026, what each is designed for, how parameter count relates to capability, and how to read benchmarks without being misled by them. By the end you will have a personal shortlist of two or three models appropriate for your hardware and your most common tasks, rather than a vague sense that you need the biggest model possible.

The key insight is that bigger is not always better. A well-trained 3B model will outperform a carelessly trained 7B model on many tasks. A model fine-tuned specifically for coding will beat a general-purpose model twice its size on programming tasks. Matching the model to the job, and choosing the largest version that fits comfortably in your hardware, is almost always the right strategy.

Understanding Model Families and Their Strengths

The landscape of open-weights models in 2026 is dominated by a handful of major families, each with distinct strengths. Meta's Llama 3 family — including the 3.1 and 3.2 releases — set a high bar for general-purpose capability and instruction following. Llama models are licensed for commercial use up to certain usage thresholds and are among the most widely tested open-weights models available, meaning there is extensive community knowledge about how to prompt them effectively.

Alibaba's Qwen 2.5 family stands out for multilingual ability and strong performance at small sizes. The 1.5B Qwen 2.5 model consistently outperforms older 7B models on several benchmarks, making it an excellent choice for machines with limited RAM. Microsoft's Phi-4 Mini is similarly capable at 3.8B parameters and is specifically optimized for reasoning tasks, making it useful for structured analysis, mathematics, and step-by-step problem solving despite its small footprint.

For coding tasks, DeepSeek-Coder and CodeLlama remain strong choices, but the general-purpose Llama 3.1 and Qwen 2.5 models have closed the gap significantly — a 7B general model with a good system prompt now handles most everyday coding needs. For long-document tasks requiring a large context window, Mistral's models have historically supported 32K and higher context lengths at smaller sizes than competitors. Cohere's Command-R models are designed specifically for retrieval-augmented generation, which you will explore in a later lesson.

Google's Gemma 2 family offers strong performance with a permissive license and is notable for being particularly well-behaved — it refuses fewer benign requests than some competing models, which matters if you are building applications that need to handle a wide range of user inputs without unexpected refusals. The choice of model family is therefore not purely about raw capability but also about behavior, license, community support, and the specific tasks you care about most.

Reading Benchmarks Critically

Benchmarks are the primary way model developers communicate capability, and they are also among the most misused pieces of information in the AI space. The most commonly cited benchmarks include MMLU (Massive Multitask Language Understanding, which tests knowledge across 57 academic subjects), HumanEval (which tests code generation), and GSM8K (which tests grade-school mathematics). A model that scores highly on all three is generally capable, but these benchmarks have well-documented limitations.

The most important limitation is contamination: if a model was trained on data that included the benchmark questions themselves, its scores are inflated and do not reflect genuine capability. Model developers have financial incentives to report high benchmark scores, and independent replication of benchmark results is uneven. The AI community has responded by creating held-out evaluation sets and contamination-detection tools, but the problem has not been fully solved. A healthy scepticism toward benchmark numbers — especially those that seem dramatically better than competing models of similar size — is warranted.

A more reliable approach for your purposes is task-specific evaluation: give your candidate models the same ten prompts representing your actual use case and compare the outputs side by side. This takes thirty minutes and tells you far more than a benchmark table. Tools like LM Studio support side-by-side model comparison in a single interface. This kind of personal evaluation also calibrates your expectations — you may find that a 3B model handles your specific tasks well enough that you never need to download anything larger.

Speed is also a form of quality. A model that generates responses at 5 tokens per second on your hardware might be technically more capable than one that generates 40 tokens per second, but if the slower speed breaks your workflow — if you find yourself losing focus waiting for responses — then the faster model is more useful to you in practice. Benchmarks never measure this, but your lived experience will.

Building Your Personal Model Shortlist

A practical approach is to maintain a shortlist of two or three models for different purposes rather than searching for one model that does everything. A lightweight model for quick questions and drafts — something in the 1.5B–3B range that loads in seconds and runs at conversational speed — is useful to have alongside a larger, slower model for tasks that genuinely require more reasoning depth.

For most users in 2026, a reasonable starting shortlist on an 8 GB RAM machine with no GPU looks like: Qwen 2.5:1.5b for quick tasks, Llama 3.2:3b for general conversation and drafting, and Phi-4-mini for reasoning-heavy tasks. On a machine with 16 GB RAM or a GPU with 8 GB VRAM, you can comfortably add a 7B model for higher-quality output. On Apple Silicon with 16 GB unified memory, a 7B model at Q4 quantization becomes your everyday model and you can occasionally load a 13B model for demanding tasks.

Specialised models are worth knowing about even if you do not use them daily. If you work with code, having a coding-focused model available matters. If you process documents in a language other than English, checking whether your general model handles that language well — or whether a multilingual-first model like Qwen 2.5 serves you better — is worth the thirty minutes of testing. The model landscape changes quickly, and checking the Ollama library for new releases every month or two is a habit worth forming.

Key Terms

Parameter count — The number of learned numerical values in a model, usually expressed in billions (B); larger counts generally indicate higher capability at the cost of more memory and slower inference.
Benchmark — A standardized test used to compare model capability; common examples include MMLU, HumanEval, and GSM8K.
Benchmark contamination — The problem that arises when training data includes benchmark questions, inflating a model's apparent performance.
Fine-tuning — Additional training on a smaller, task-specific dataset after the original large-scale training; used to specialize a model for coding, instruction following, or a particular domain.
Instruction tuning — A specific form of fine-tuning that teaches a model to follow natural-language instructions and converse helpfully rather than simply completing text.
Context length — The maximum number of tokens (words and word-fragments) a model can process in a single request, determining how long a document or conversation it can handle.

Discussion Questions

If benchmark scores can be unreliable due to contamination, what alternative methods would you use to evaluate whether a model is suitable for a task you care about? What would your personal evaluation test include?
The existence of multilingual-first models like Qwen 2.5 raises a question about who the AI field is designing for. How does model selection intersect with questions of linguistic and cultural inclusion?
Should model developers be required to disclose whether their models were evaluated on held-out data free from contamination? What mechanisms would enforce this, and who should be responsible?
How does the strategy of maintaining a shortlist of specialized models compare to the cloud AI approach of using one general-purpose API for everything? What does each approach assume about the user?

Key Takeaways

Model families have distinct specializations: Qwen 2.5 excels at small sizes and multilingual tasks, Phi-4 Mini at reasoning, Llama 3 at general capability, and Gemma 2 at predictable, low-refusal behavior.
Benchmark scores are useful but imperfect; personal task-specific evaluation on your actual use case is more reliable than published leaderboard numbers.
Maintaining a small shortlist of models for different purposes is more practical than searching for a single best model.
Speed is a dimension of quality: a faster model you actually use beats a slower model you avoid because the wait disrupts your thinking.

Homework

Select two models from the Ollama library that you believe might suit different needs — one small (under 3B) and one medium (7B range), or two models from different families. Pull both and give each the same five test prompts drawn from a task you actually care about. Write 400 words comparing the outputs: which model performed better, where did each surprise you, and which would you keep on your shortlist and why. If you cannot run models locally, use the Hugging Face Open LLM Leaderboard to compare two models and write 400 words interpreting what the benchmark scores do and do not tell you.

Your notes — saved on this device

9. Connecting Local Models to Your Existing Tools

Introduction

Running a language model in a terminal and typing questions manually is a useful starting point, but it is not how most people want to use AI in practice. The real productivity gain comes when local models integrate seamlessly into the tools you already use — your text editor, your note-taking application, your browser, your code environment. This lesson covers the practical pathways for connecting a locally running model to the software in your existing workflow.

The key that unlocks most integrations is the OpenAI-compatible API. Ollama, LM Studio, and Jan all expose a local HTTP endpoint that accepts requests in exactly the same format that OpenAI's API uses. This means that any application with an OpenAI integration — and there are thousands of them — can be redirected to your local model by changing one setting. You do not need to modify the application, rewrite code, or understand how the model works internally. You just change a URL and, in most cases, the application works immediately with your local model.

This lesson walks through three categories of integration: graphical chat interfaces that give you a polished user experience, editor and IDE plugins that bring local AI into your writing and coding environment, and programmatic access via Python and JavaScript for users who want to build their own simple tools. By the end you will have at least one integration running that fits your actual daily workflow.

Graphical Interfaces: From Terminal to Chat Window

The most immediate improvement over the terminal is a graphical chat interface. Open WebUI is the most popular choice: it is a browser-based application you can run locally that provides a ChatGPT-style conversation interface connected to your Ollama server. It supports conversation history, multiple chat sessions, document uploads, image understanding (with vision-capable models), and a model selector that lets you switch between any model you have downloaded. Installation via Docker requires two commands and takes about five minutes.

LM Studio includes a built-in chat interface as part of its desktop application, making it a single-package solution for users who prefer not to manage Docker containers. The chat view in LM Studio also includes parameter controls — you can adjust temperature (which affects how creative or deterministic the responses are), context length, and other settings without touching a configuration file. Jan similarly provides an integrated chat interface with a focus on privacy: it has no telemetry by default and all conversation history is stored locally on your machine.

For mobile users, several iOS and Android applications can connect to a locally running Ollama server over your home Wi-Fi network. This setup — a desktop machine running Ollama, a phone app connecting to it — effectively gives you a private AI assistant on your phone that runs entirely on hardware you own, with no subscription and no data leaving your home network. The Ollama server needs to be configured to accept connections from your local network rather than only from localhost, which involves a one-line environment variable change documented in the Ollama repository.

Choosing between these interfaces is largely a matter of preference and technical comfort. Open WebUI is the most feature-rich and actively developed. LM Studio's built-in interface is the most beginner-friendly. Jan is the most privacy-preserving. All three accomplish the same core function: turning your local Ollama server into something that feels like a normal application rather than a command-line tool.

Editor Integrations: AI in Your Writing and Coding Environment

For writers, the most useful integrations are in tools like Obsidian and VS Code. Obsidian — a popular local-first note-taking application — has a plugin called Copilot that can be configured to use a local Ollama endpoint instead of a cloud API. Once configured, you can highlight any text in your notes, open a chat panel, and ask questions about it, request rewrites, or generate continuations — all without leaving the application and without sending your notes to a cloud server. This is particularly valuable for journals, research notes, or any writing that contains sensitive personal information.

For developers, VS Code has multiple extensions that support local model backends. Continue.dev is one of the most capable: it provides inline code completion, a chat sidebar, and the ability to ask questions about selected code, all pointed at your local Ollama instance. The experience is similar to GitHub Copilot but runs entirely locally. Configuration requires adding your Ollama endpoint and model name to Continue's settings file — a process that takes about ten minutes and is well-documented.

Neovim users have even more mature local AI integration options, as the community around that editor has built several plugins designed from the start to work with local models. Emacs users will find similar support. The pattern is consistent across editors: find a plugin that supports OpenAI-compatible endpoints, point it at http://localhost:11434/v1, provide any string as the API key (local Ollama does not require authentication), and select your model. The same three steps work across virtually every tool that has attempted an AI integration.

Programmatic Access: Building Simple Tools in Python

For users comfortable with basic programming, Ollama's Python library makes it possible to build custom tools in a few lines of code. After installing the library with pip install ollama, you can send a prompt and receive a response in four lines of Python. This opens up possibilities that no existing application addresses: automatically summarizing a folder of documents, asking questions about a CSV file, drafting responses to emails based on templates, or generating structured data from unstructured text.

Ollama also supports the OpenAI Python SDK directly. You instantiate the OpenAI client with base_url="http://localhost:11434/v1" and any string as the API key. Every method — chat.completions.create, streaming, function calling, embeddings — works with your local model. This matters because the enormous ecosystem of tutorials, libraries, and examples built around the OpenAI SDK is immediately available to you, all running against your local hardware rather than a paid cloud service.

A practical starting project is a simple document question-answering script: read a text file, construct a prompt that includes its contents, ask a question, and print the answer. This script has real utility — you can use it to quickly extract information from long documents, compare two versions of a file, or check whether a piece of writing covers required topics. It also teaches you the core pattern of all local AI programming: load content, construct a prompt, send to the model, process the output. Every more sophisticated application builds on this foundation.

Key Terms

OpenAI-compatible API — An HTTP endpoint that accepts requests in the same format as OpenAI's API, allowing applications built for OpenAI to work with local models without code changes.
Docker — A containerization platform that packages applications with all their dependencies; used to run Open WebUI and other self-hosted tools reliably across different operating systems.
Temperature — A parameter controlling the randomness of model output; low values produce consistent, predictable responses while high values produce more varied and creative ones.
Plugin / extension — A software add-on that extends an existing application; editor plugins bring AI functionality into tools like VS Code and Obsidian.
Embedding — A numerical representation of text as a vector of numbers; used by retrieval systems to find semantically similar passages without keyword matching.
Localhost — The network address referring to the current machine; Ollama listens on localhost by default, meaning only programs on the same computer can reach it.

Discussion Questions

The OpenAI-compatible API standard means that local models and cloud models are increasingly interchangeable from a software perspective. What are the implications of this for the AI industry, and who benefits most?
Editor integrations like Continue.dev mean that AI can read your code and notes as you write them. Does the local nature of these integrations change your comfort level with this kind of access? Where do you draw the line?
Most existing AI integrations were designed for cloud APIs with reliable high-speed internet connections. What assumptions baked into these tools become problems when the model is running on local hardware that might be slower or less consistent?

Key Takeaways

The OpenAI-compatible API standard means that thousands of existing tools can connect to your local model by changing a single URL, with no code modifications required.
Graphical interfaces like Open WebUI, LM Studio, and Jan transform local AI from a terminal curiosity into a usable daily tool for non-technical users.
Editor integrations bring AI assistance directly into writing and coding workflows, with local models eliminating the privacy concerns that come with cloud-based coding assistants.
Even basic Python programming skills unlock powerful custom workflows that no existing application offers out of the box.

Homework

Choose one integration from this lesson and implement it: install Open WebUI, configure the Continue.dev VS Code extension, or write a ten-line Python script that sends a prompt to your local Ollama server and prints the response. Write 350 words describing what you built, what friction you encountered, and what you used it for in a real task. If you do not have a suitable device, research one integration in depth — read its documentation and watch a tutorial — and write 350 words describing what it does, how it works, and what you would use it for if you had access to the hardware.

Your notes — saved on this device

10. Privacy, Data Sovereignty, and the Ethics of Local AI

Introduction

Every time you type a message into a cloud AI service, that text travels across the internet to a data center operated by a company operating under the laws of the country where those servers are located. Your words are processed, logged, and in many cases used to improve future versions of the model. For most casual questions this raises no concern. But for a Sikh community organization managing personal information about members, a healthcare worker asking questions about a patient case, a lawyer drafting a confidential document, or anyone living under a government that monitors communications, the implications are serious.

Local AI eliminates this concern by keeping data on your hardware. No packet leaves your machine. No company logs your query. No server operator can be compelled by a government to produce your conversation history. This is not merely a technical convenience — it is a form of digital autonomy that carries genuine ethical weight, and understanding it clearly allows you to make informed decisions about when local AI is not just more practical but actually more responsible.

This lesson examines the privacy architecture of local AI in concrete terms, the regulatory landscape that is reshaping these decisions for organizations in 2026, the genuine limitations of local AI as a privacy tool, and the broader ethical questions about who controls AI infrastructure and what values are embedded in the models themselves.

The Privacy Architecture of Local Inference

When you run a model with Ollama, the inference process is entirely contained within your operating system. The model weights are files on your disk. The computation happens in your CPU and GPU. The output appears in your terminal or browser. At no point does the content of your prompt or the model's response traverse a network connection. You can verify this by monitoring your network traffic with a tool like Little Snitch on macOS or Wireshark on any platform — you will see no outbound requests to external servers while the model is generating a response.

This architecture has concrete consequences. GDPR Article 4 defines personal data as any information that can be used to identify a living person, and processing it using a cloud AI service creates obligations around data processing agreements, transfer mechanisms, and breach notification. Processing the same data with a local model sidesteps these obligations because the data never leaves the organization's infrastructure. European healthcare providers, legal firms, and financial institutions were among the first professional communities to adopt local AI specifically because their regulatory environment made cloud AI legally complicated.

India's Digital Personal Data Protection Act (DPDP Act, 2023) and similar legislation being enacted across Asia and Africa reflect a global tightening of data localization requirements. Organizations that process data about citizens of these jurisdictions are increasingly required to ensure that data is processed within the jurisdiction's boundaries. A local model running on hardware within the jurisdiction satisfies this requirement by definition. A cloud model hosted in the United States or Ireland does not, regardless of what the cloud provider's terms of service say about data handling.

For Sikh community organizations specifically — gurdwaras managing langar rosters, parchar groups communicating across borders, institutions serving diaspora communities across multiple legal jurisdictions — the local AI model is not an ideological preference but a practical tool for navigating an increasingly complicated legal environment without expensive legal counsel or enterprise compliance software.

Limitations: What Local AI Cannot Protect Against

Local AI is not a complete privacy solution, and overstating its protections would be misleading. The model itself — the weights file you download — was trained by an organization that may have embedded biases, values, or behavioral constraints into it. Running Meta's Llama 3 locally does not make it a politically neutral tool; it reflects the values, data selection decisions, and safety fine-tuning choices that Meta made during training. You have privacy from cloud operators but not from the values encoded in the model by its creators.

Furthermore, the applications you use to interact with the model may collect data independently. If you run Open WebUI and configure it to back up conversation history to a cloud service, your conversations leave your machine through that path rather than through the model inference itself. Plugins, extensions, and integrations each introduce their own data flows that need to be audited separately. True data sovereignty requires examining the entire pipeline, not just the inference step.

The hardware you run the model on is also a factor. Enterprise IT environments may have monitoring software that logs all local processes. Shared computers present obvious risks. Laptops that sync files to cloud storage may inadvertently upload model outputs if conversation logs are saved to a synced folder. These are not arguments against local AI as a privacy tool — they are reminders that privacy is a systems property, not a feature of any single component.

Key Terms

Data sovereignty — The principle that data is subject to the laws and governance of the jurisdiction in which it is collected and processed.
GDPR — General Data Protection Regulation; European Union law governing personal data processing, in force since 2018 and actively enforced in 2026.
Data localization — Legal requirements mandating that certain categories of data be stored and processed within a specific country or region.
Privacy by design — An engineering principle that builds privacy protections into a system's architecture from the start rather than adding them later.
Telemetry — Automated data collection by software about how it is used; many applications collect telemetry by default unless the user opts out.
ਡੇਟਾ ਸੁਰੱਖਿਆ — Punjabi term for data protection; understanding this concept in your own language helps when explaining local AI to community members unfamiliar with English technical vocabulary.

Discussion Questions

Sikhi teaches the concept of ਸਰਬੱਤ ਦਾ ਭਲਾ — the well-being of all. How does the question of who controls AI infrastructure connect to this principle? Does data sovereignty serve the many or primarily the privileged few with the hardware to run local models?
A model trained by a large technology corporation encodes that corporation's values even when run locally. Can a truly community-controlled AI exist, and what would it take to create one?
Healthcare workers in many countries are legally prohibited from using cloud AI tools for patient-related queries. How should institutions communicate local AI as a compliant alternative without creating the false impression that it is risk-free?
The DPDP Act in India and GDPR in Europe were both designed for a world of databases, not language models. What aspects of AI-specific privacy do current laws fail to address?

Key Takeaways

Local inference keeps data entirely on your hardware: no network traffic, no cloud logging, no third-party data processing agreements required.
Regulatory frameworks including GDPR, India's DPDP Act, and data localization laws in multiple countries make local AI not just a preference but a compliance tool for many organizations.
Privacy is a systems property: a local model does not automatically make an entire workflow private if other components — applications, plugins, cloud backups — transmit data externally.
The values embedded in a model during training by its creator remain present regardless of where inference runs; local AI provides operational privacy but not ideological neutrality.

Homework

Identify a real situation — in your own life, your community, or a professional context you know well — where data privacy matters. Research whether the relevant jurisdiction has laws that would affect using a cloud AI tool for that situation (look up GDPR if you are in Europe, DPDP Act if in India, or your state or country's applicable law). Write 400 words describing the situation, the legal landscape, and whether local AI would be a genuine solution or whether other safeguards would also be needed.

Your notes — saved on this device

11. Retrieval-Augmented Generation: Teaching Your Model About Your Documents

Introduction

Every open-weights language model has a knowledge cutoff — a date after which it has not seen any new information. It also has no knowledge of your personal documents, your organization's internal records, or the specific corpus of material you want to reason about. Asking a model about your own files produces either confident hallucinations or an honest admission that it does not know. This is a fundamental limitation of base models, and it is one that retrieval-augmented generation was designed to address.

Retrieval-augmented generation, commonly abbreviated as RAG, is a technique that connects a language model to an external document store. When you ask a question, the system first searches the document store for passages relevant to your query, then provides those passages to the model as context, and then asks the model to answer using that context. The model does not need to memorize your documents — it reads them on demand, the same way a research assistant might pull relevant pages from a filing cabinet before answering your question.

This lesson explains how RAG works, what the components are, how to build a minimal working system locally using Ollama and a vector database, and what RAG is genuinely useful for versus where it struggles. Understanding RAG is one of the most practically valuable things you can learn about local AI because it transforms a generic chat assistant into something that can reason about your specific knowledge base.

How RAG Works: The Pipeline Explained

A RAG system has four stages: ingestion, embedding, retrieval, and generation. Ingestion is the process of reading your documents and splitting them into smaller passages, typically 200–500 words each with some overlap between adjacent passages. This chunking matters because language models have limited context windows — you cannot simply paste an entire book into a prompt. Splitting the document into passages allows the system to select only the relevant portions for any given question.

Embedding is the process of converting each text passage into a vector — a list of numbers, typically several hundred to over a thousand values — that encodes its meaning. Two passages that are semantically similar will have vectors that are mathematically close to each other, even if they use different words. A passage about a gurdwara's annual budget will be vectorially close to a passage about the committee's financial decisions, because both are about the same topic. These vectors are stored in a vector database, which is a specialized data store optimized for finding the closest vectors to a given query vector.

Retrieval happens at query time. When you ask a question, the same embedding model converts your question into a vector and searches the database for the passages whose vectors are closest to it. The top three to five most relevant passages are selected and assembled into a context block. Generation is the final step: the context block and your question are combined into a prompt sent to the language model, which produces an answer grounded in the retrieved passages. The model is typically instructed to answer only from the provided context and to say that it does not know if the answer is not present.

Running this pipeline entirely locally requires two models: an embedding model and a generation model. Ollama provides both. The nomic-embed-text model is the standard embedding model for local RAG pipelines — it is small (around 270 MB), fast, and produces high-quality embeddings for English and many other languages. For the generation model, any of the 7B general-purpose models works well. The vector database can be Chroma (a lightweight Python library) or Qdrant (which runs as a local Docker container). Both are free and open-source.

Building a Local RAG System and Understanding Its Limits

A functional local RAG pipeline can be built in under 100 lines of Python using the Ollama Python library, LangChain or LlamaIndex for orchestration, and Chroma for vector storage. The basic script reads a folder of PDF or text files, splits them into chunks, embeds each chunk with nomic-embed-text, stores the vectors in Chroma, and then accepts questions from the user. For each question it retrieves the top five chunks, constructs a prompt, and calls the generation model. For documents you frequently reference — policy manuals, research papers, meeting minutes, scripture commentaries — this kind of system provides genuine utility.

Understanding what RAG does well and what it struggles with prevents frustration. RAG excels at factual lookup: finding specific information that exists verbatim or near-verbatim in the document. It handles questions like "What does this policy say about X?" or "Which section covers Y?" reliably when the document is well-structured. It struggles with synthesis questions that require connecting information spread across many documents, with questions that require external knowledge not present in the document store, and with documents that are heavily formatted (complex tables, multi-column layouts, or PDFs with images) where the text extraction step may produce garbled input.

Chunking strategy significantly affects retrieval quality. Chunks that are too small miss context; chunks that are too large reduce retrieval precision. Overlapping chunks — where each chunk shares 50–100 words with the adjacent chunk — help ensure that answers at chunk boundaries are not missed. Document quality matters too: a clean plain-text document yields far better retrieval than a scanned PDF processed with imperfect optical character recognition. Pre-processing your documents — cleaning whitespace, removing headers and footers, ensuring consistent encoding — is often more impactful than tuning model parameters.

Key Terms

RAG (Retrieval-Augmented Generation) — A technique combining document retrieval with language model generation to answer questions grounded in a specific document corpus.
Vector embedding — A numerical representation of text as a high-dimensional vector, where semantic similarity between texts is reflected by mathematical closeness between their vectors.
Vector database — A data store optimized for similarity search over high-dimensional vectors; examples include Chroma, Qdrant, and Weaviate.
Chunking — The process of splitting documents into smaller passages for embedding and retrieval; chunk size and overlap are key parameters affecting RAG quality.
Knowledge cutoff — The date after which a model's training data does not include new information; RAG addresses this limitation for domain-specific or recent documents.
ਗਿਆਨ ਭੰਡਾਰ — Punjabi for knowledge treasury; a fitting metaphor for a RAG document store that makes an organization's accumulated knowledge queryable by AI.

Discussion Questions

A RAG system built on historical Sikh documents — gurmat literature, historical accounts, katha recordings transcribed to text — could allow anyone to ask questions and receive sourced answers. What safeguards would you want in place before deploying such a system publicly?
RAG grounds model responses in specific documents, which reduces hallucination but does not eliminate it. A model can still misinterpret or selectively quote from retrieved passages. How should a RAG system communicate its uncertainty to users?
Building a RAG system requires organizing, cleaning, and curating a document corpus. This curation process itself involves editorial decisions about what to include and exclude. Who should make those decisions for a community knowledge base, and through what process?

Key Takeaways

RAG connects a language model to a document corpus at query time, allowing it to answer questions about documents it was never trained on without requiring retraining.
The pipeline has four stages — ingestion, embedding, retrieval, and generation — each of which can be run entirely locally with Ollama and open-source tools.
RAG is strongest for factual lookup in well-structured text and weakest for synthesis across many documents or for questions requiring external knowledge.
Document quality and chunking strategy are often more important to RAG performance than the choice of language model or embedding model.

Homework

Choose a personal or community document you have access to — a policy document, a PDF book or article, meeting notes, or any substantial text. Design (but do not necessarily implement) a RAG system for it: write 400 words describing what documents you would include, how you would chunk them, what types of questions you would most want to answer, and what risks or limitations you would need to communicate to users. If you have the technical ability, implement a minimal version using the Ollama Python library and Chroma and describe what you built and how it performed.

Your notes — saved on this device

12. Customizing Models: System Prompts, Modelfiles, and Fine-Tuning Basics

Introduction

A base open-weights model is a general-purpose tool. It can write poetry, debug code, explain scientific concepts, translate languages, and summarize documents — but it does all of these things with the same default persona and the same implicit assumptions about what you want. For many use cases, this generality is exactly right. But for applications with a specific purpose — a customer support assistant that only answers questions about one product, a study tool that always responds in Punjabi, a coding assistant that follows your team's style guide — you need a way to constrain and direct the model's behavior.

This lesson covers three levels of model customization, arranged from simplest to most technically demanding. The first is the system prompt: a set of instructions provided to the model before every conversation that shapes its persona, tone, and scope. The second is Ollama's Modelfile format, which lets you create a named, reusable model configuration that bundles a base model with a system prompt, parameter settings, and other options. The third is fine-tuning: the process of continuing a model's training on your own data to change its behavior more fundamentally than any prompt can achieve.

Understanding which level of customization you need for a given application is as important as understanding how to implement each one. System prompts solve most problems that arise in practice. Fine-tuning is powerful but expensive, and it is often the wrong tool when a well-crafted prompt would suffice.

System Prompts and Modelfiles

A system prompt is text that is prepended to every conversation before the user's first message. The model treats it as standing instructions from the operator rather than from the user. A system prompt might say: "You are a teaching assistant for a Sikh studies course. Always answer in clear, accessible English. If a user asks about a Gurbani concept, explain it using the language of Gurmat philosophy without introducing external religious frameworks. If you do not know an answer, say so directly rather than speculating." This single block of text meaningfully changes the model's behavior for every subsequent exchange without any modification to the model weights.

In Ollama, system prompts can be provided interactively in a chat session using /set system "your prompt here", but the more useful approach is to create a Modelfile. A Modelfile is a plain text file that tells Ollama how to build a new named model variant. At minimum it specifies a base model with FROM llama3.2 and a system prompt with SYSTEM "...". You can also set the temperature with PARAMETER temperature 0.3 for more consistent, less creative responses, or increase the context window with PARAMETER num_ctx 8192. Running ollama create my-assistant -f Modelfile builds the variant, which then appears in ollama list like any other model and can be run with ollama run my-assistant.

Modelfiles are particularly useful for creating purpose-specific tools that non-technical colleagues can use. You create and test the Modelfile yourself, share the resulting model configuration, and your colleagues simply run ollama run your-assistant-name and interact with a well-configured, appropriately constrained model without needing to understand how any of it works. This separation of roles — configuration by a knowledgeable person, use by anyone — mirrors how professional software tools are typically deployed.

Effective system prompt writing is itself a skill. The most common mistake is making prompts too long and contradictory — if a prompt includes thirty rules, the model may follow some and ignore others, and conflicts between rules produce unpredictable behavior. A focused prompt with five to eight clear instructions reliably outperforms a sprawling one. Testing your system prompt against adversarial inputs — questions designed to make the model ignore its instructions — is essential before deploying any model variant for others to use.

Fine-Tuning: When Prompts Are Not Enough

Fine-tuning is the process of continuing a model's training on a new, smaller dataset to adjust its behavior at the level of the weights rather than at the level of runtime instructions. Where a system prompt says "always respond in this style", fine-tuning changes what the model has learned so that it naturally tends toward that style without being told. The distinction matters for applications where the model's behavior needs to be deeply consistent across very long conversations, where the desired style is too subtle or complex to capture in a prompt, or where the model needs to internalize specialized vocabulary or domain knowledge not present in its original training data.

The standard technique for fine-tuning on consumer hardware in 2026 is QLoRA — Quantized Low-Rank Adaptation. LoRA works by adding small trainable matrices to specific layers of the frozen base model, learning only the adjustments needed for the target task. This reduces the number of trainable parameters from billions to tens of millions. QLoRA further reduces memory requirements by keeping the base model in 4-bit quantization during training. The practical result is that a 7B model can be fine-tuned on a consumer GPU with 12 GB of VRAM — an RTX 4070 Ti, for example — in a few hours on a dataset of several thousand examples.

The toolchain for fine-tuning in 2026 centers on three libraries: Unsloth (optimized for speed on consumer GPUs), Axolotl (a higher-level framework for configuration-driven training pipelines), and HuggingFace's TRL library (which provides the RLHF and preference optimization techniques used to create instruction-following models). All three are open-source and free. A fine-tuning dataset needs to be structured as pairs of instruction and response, typically in a JSON or CSV format. The quality of the dataset matters far more than its size: a hundred high-quality examples that precisely represent the desired behavior often outperforms ten thousand mediocre ones.

Key Terms

System prompt — Instructions provided to a language model before every user message, shaping its persona, scope, and behavior without modifying its weights.
Modelfile — An Ollama configuration file that bundles a base model with a system prompt, parameter settings, and other options into a reusable named model variant.
LoRA (Low-Rank Adaptation) — A fine-tuning technique that adds small trainable matrices to frozen model layers, reducing trainable parameters to a fraction of the total.
QLoRA — LoRA combined with 4-bit quantization of the base model during training, making fine-tuning feasible on consumer hardware with limited VRAM.
Instruction dataset — A collection of instruction-response pairs used to fine-tune a model for a specific task or style of interaction.
Overfitting — A training failure mode where the model memorizes the fine-tuning examples rather than generalizing from them, producing poor performance on inputs not seen during training.

Discussion Questions

A system prompt can instruct a model to behave as a specific persona — a historical figure, a religious teacher, a fictional character. What ethical guidelines should govern this use of system prompts, particularly when the persona is a revered religious or historical person?
Fine-tuning a model on a community's text corpus — historical documents, recorded speeches, published writings — raises questions about consent and representation. Who speaks for the community in deciding what goes into a training dataset?
The QLoRA technique makes fine-tuning accessible to individuals with consumer hardware, but creating a high-quality training dataset still requires significant human effort. How does this constraint affect who is able to create specialized AI tools, and what are the implications for equity in AI development?
When a fine-tuned model produces outputs that reflect the biases in its fine-tuning dataset, who bears responsibility — the person who created the dataset, the person who ran the fine-tuning, or the person who deployed the model?

Key Takeaways

System prompts are the fastest and most accessible way to customize model behavior; a focused, well-tested system prompt solves most practical customization needs without any training.
Ollama's Modelfile format packages a system prompt and parameter settings into a reusable, shareable model configuration that non-technical users can run with a single command.
QLoRA makes genuine fine-tuning feasible on consumer GPUs with 12 GB or more of VRAM, but it requires a high-quality instruction dataset and careful evaluation to avoid overfitting.
The right level of customization depends on the task: system prompts for behavioral guidance, Modelfiles for reusable deployment, and fine-tuning only when prompt-level control is genuinely insufficient.

Homework

Write a system prompt for one specific use case you care about — a study assistant, a writing helper, a document summarizer, a language tutor, or any other focused application. Your system prompt should be 150–250 words and include: a clear description of the model's role, specific instructions about tone and format, at least one explicit boundary (what the model should refuse or redirect), and a statement about how the model should handle uncertainty. Test your system prompt with at least five varied inputs and write 300 words reflecting on what worked, what did not, and how you revised it.

Your notes — saved on this device

References & further reading

Ollama official documentation and model library (ollama.com)
LM Studio documentation (lmstudio.ai)
Hugging Face model hub and learning guides (huggingface.co)
Mozilla's article on running open-source LLMs locally (Mozilla AI / Mozilla blog)
MIT Technology Review reporting on open-weight AI models

What you'll learn

Key terms — ਸ਼ਬਦਾਵਲੀ

Lessons

1. Why Run AI on Your Own Machine?

Homework

2. What 'Open Weights' Really Means

Homework

3. The Tools That Run Models for You

Homework

4. Model Sizes and Quantization Made Simple

Homework

5. Hardware: Will It Fit on My Computer?

Homework

6. Trade-offs and Your First Steps

Homework

7. Setting Up Ollama: Your First Model in Under Ten Minutes

Introduction

Installing Ollama and Understanding What It Does

Pulling and Running Your First Model

Testing, Troubleshooting, and the Local API

Key Terms

Discussion Questions

Further Reading

Key Takeaways

Homework

8. Choosing the Right Model for the Job

Introduction

Understanding Model Families and Their Strengths

Reading Benchmarks Critically

Building Your Personal Model Shortlist

Key Terms

Discussion Questions

Further Reading

Key Takeaways

Homework

9. Connecting Local Models to Your Existing Tools

Introduction

Graphical Interfaces: From Terminal to Chat Window

Editor Integrations: AI in Your Writing and Coding Environment

Programmatic Access: Building Simple Tools in Python

Key Terms

Discussion Questions

Further Reading

Key Takeaways

Homework

10. Privacy, Data Sovereignty, and the Ethics of Local AI

Introduction

The Privacy Architecture of Local Inference

Limitations: What Local AI Cannot Protect Against

Key Terms

Discussion Questions

Further Reading

Key Takeaways

Homework

11. Retrieval-Augmented Generation: Teaching Your Model About Your Documents

Introduction

How RAG Works: The Pipeline Explained

Building a Local RAG System and Understanding Its Limits

Key Terms

Discussion Questions

Further Reading

Key Takeaways

Homework

12. Customizing Models: System Prompts, Modelfiles, and Fine-Tuning Basics

Introduction

System Prompts and Modelfiles

Fine-Tuning: When Prompts Are Not Enough

Key Terms

Discussion Questions

Further Reading

Key Takeaways

Homework

Gurbani Verses — ਗੁਰਬਾਣੀ

Parallel text — ਪਾਠ ਤੇ ਅਰਥ

References & further reading

Flashcards — ਕਾਰਡ ਅਭਿਆਸ

Course test

Read the source texts

Rate this course

Discussion & Q&A