Hands-on tutorial: Ollama on Windows, macOS, and Linux

Welcome to the module dedicated to Ollama.

Until now, many artificial intelligence projects are built by consuming cloud model APIs, such as OpenAI, Anthropic, Google, or other providers. That approach is very powerful, but it is not the only one.

In this tutorial we will work with another scenario: running language models directly on our own computer.

This matters for several reasons:

Local privacy: if models are downloaded and executed locally, prompts and responses do not need to leave the machine.
Inference cost: there is no per-token cost when running local models, although there are indirect costs in hardware, energy, and storage.
Technical learning: understanding how models are downloaded, loaded, executed, and exposed helps you better understand real AI systems.
Fast prototyping: it lets you experiment without always depending on an external API.
Integration: Ollama exposes a local HTTP API that can be consumed from Python, Node.js, .NET, web applications, RAG tools, and agents.

The tutorial is designed for mixed teams: Windows, macOS, and Linux. We will first cover the common concepts and then the specific differences for each operating system.

What is Ollama
When to use local models
System requirements
Installation by operating system
Verifying the installation
How to restart Ollama
First local model
Essential CLI commands
Where models are stored
Environment variables by operating system
Ollama REST API
Python integration
Embeddings for RAG
How to know if it is using CPU or GPU
Context length
OpenAI API compatibility
Best practices
Common errors by operating system
Proposed exercises
Closing

1. What is Ollama

Ollama is a tool that lets you download, manage, and run language models locally.

A simple way to understand it is to think of Ollama as a runtime environment for AI models. The user requests a model by name, Ollama downloads it if not available, loads it into memory, and lets you use it from:

the command line;
a local HTTP API;
libraries such as Python or JavaScript;
integrations compatible with OpenAI-style APIs.

By default, the local Ollama server is available at:

http://localhost:11434

And its API base is available at:

http://localhost:11434/api

Main components

Component	Description
Ollama server	Process that runs models and serves local requests.
`ollama` CLI	Command to download, run, list, and manage models.
REST API	Local HTTP API to integrate models into applications.
Models	Files downloaded locally that contain the model weights.

2. When to use local models

Ollama is especially useful when you want to:

learn how models work locally;
test prompts at no per-token cost;
build private prototypes;
run RAG tests on a development machine;
run models without a permanent internet connection;
integrate AI into internal tools;
compare smaller models before using more powerful cloud models.

Quick comparison

Aspect	Local Ollama	Cloud APIs
Privacy	High if the model runs locally	Depends on the provider
Per-token cost	No direct per-token cost	Pay per use
Maximum quality	Limited by hardware and local model	Access to frontier models
Speed	Depends on local CPU/GPU	Depends on network and provider
Offline	Yes, after downloading the model	No
Maintenance	User’s responsibility	Provider’s responsibility

Practical rule: use Ollama for learning, prototyping, and building local solutions; use cloud APIs when you need maximum quality, managed scalability, or frontier models.

3. System requirements

The requirements depend on the model you want to run.

General requirements

8 GB of RAM as a minimum for small models.
16 GB of RAM or more to work comfortably.
SSD recommended.
Enough free space: models can take from a few GB up to tens or hundreds of GB.
GPU recommended for better performance, although many small models can run on CPU.

Windows

Typical requirements:

Windows 10 22H2 or newer, or Windows 11.
Up-to-date NVIDIA drivers if using an NVIDIA GPU.
Up-to-date AMD Radeon driver if using an AMD Radeon GPU.

macOS

Typical requirements:

macOS Sonoma 14 or newer.
Apple Silicon M-series with CPU/GPU support.
On Intel/x86 Macs, expect CPU execution.

Linux

Typical requirements:

A modern Linux distribution.
systemd if you plan to run Ollama as a service, as in the standard installation.
NVIDIA or AMD/ROCm drivers properly installed if you want to use the GPU.
Permissions to install system services or binaries.

Practical model size rule

An approximate rule for models quantized to 4 bits:

Model with N B parameters, where B means billion, ≈ N × 0.6 GB of RAM/VRAM

In this context, 7B means approximately 7 billion parameters.

Approximate examples:

Model	Approximate memory	Comment
3B	2 to 3 GB	Good for modest machines
7B / 8B	4 to 6 GB	Good entry point
13B / 14B	8 to 10 GB	Requires more memory
32B	18 to 24 GB	Recommended on a large GPU or with plenty of RAM
70B	40 GB or more	Advanced use

These figures are indicative. Actual consumption depends on quantization, context, GPU, offloading, and model configuration.

4. Installation by operating system

4.1 Windows

Option A: graphical installer

Go to the official Ollama download page.
Download OllamaSetup.exe.
Run the installer.
Open a new PowerShell or CMD terminal.
Verify:

ollama --version

On Windows, Ollama keeps running in the background and the ollama command becomes available in PowerShell, CMD, or your preferred terminal.

Option B: winget

If you use winget, you can install with:

winget install Ollama.Ollama

Verify that the server responds

curl.exe http://localhost:11434

The expected response is similar to:

Ollama is running

On classic Windows PowerShell, curl may behave differently because it can resolve to an alias. To avoid issues, use curl.exe or Invoke-WebRequest.

4.2 macOS

Recommended option: official application

Download the .dmg file from the official Ollama page.
Mount the .dmg.
Drag Ollama.app to Applications.
Open Ollama.
If the application asks for permission to install the ollama command on the PATH, accept.

Verify from Terminal:

ollama --version

Verify that the server responds:

curl http://localhost:11434

Alternative with Homebrew

In some development environments Homebrew may be preferred:

brew install --cask ollama-app

Then open the Ollama application and verify:

ollama --version

4.3 Linux

Installation with the official script

curl -fsSL https://ollama.com/install.sh | sh

Then start and verify the service:

sudo systemctl start ollama
sudo systemctl status ollama

Verify the API:

curl http://localhost:11434

Expected response:

Ollama is running

Security note

In corporate or production environments, it is good practice to review any remote script before executing it.

An alternative is to download it, inspect it, and then run it:

curl -fsSL https://ollama.com/install.sh -o install-ollama.sh
less install-ollama.sh
sh install-ollama.sh

5. Verifying the installation

The basic commands are the same on the three systems.

Windows

ollama --version
curl.exe http://localhost:11434

macOS

ollama --version
curl http://localhost:11434

Linux

ollama --version
curl http://localhost:11434
sudo systemctl status ollama

If everything is correct, ollama --version will display the installed version and curl http://localhost:11434 will respond that Ollama is running.

6. How to restart Ollama

In several parts of this tutorial we will say “restart Ollama”. It is worth clarifying what that means on each operating system.

Windows

Find the Ollama icon in the system tray.
Right-click it.
Choose Quit.
Reopen Ollama from the Start menu.

Then verify:

curl.exe http://localhost:11434

macOS

Click the Ollama icon in the menu bar.
Choose Quit Ollama.
Reopen Ollama.app from Applications.

Then verify:

curl http://localhost:11434

Linux

If Ollama was installed as a systemd service:

sudo systemctl restart ollama
sudo systemctl status ollama

To review logs after the restart:

journalctl -e -u ollama

7. First local model

We will run a lightweight model to get started.

ollama run llama3.2

The first time, Ollama will download the model. Then it will open an interactive session:

>>> Send a message (/? for help)

Try:

Explain in one sentence what supervised learning is.

To exit:

/bye

Or also:

Ctrl+D

What happened behind the scenes

When we run:

ollama run llama3.2

Ollama does the following:

It checks whether the model is already downloaded.
If it is not downloaded, it pulls it from the model registry.
It loads the model into RAM or VRAM.
It opens an interactive session.
It keeps the model in memory for some time to avoid reloading it on every query.

By default, Ollama keeps the model loaded for 5 minutes after the last use. This explains why RAM or VRAM does not always drop immediately when a query finishes.

To free memory manually:

ollama stop llama3.2

It can also be controlled with OLLAMA_KEEP_ALIVE or with the keep_alive parameter in the API.

8. Essential CLI commands

These commands work the same on Windows, macOS, and Linux.

# List downloaded models
ollama list

# Download a model without running it
ollama pull llama3.2

# Run a model
ollama run llama3.2

# View models currently loaded in memory
ollama ps

# Stop a loaded model
ollama stop llama3.2

# Remove a model from disk
ollama rm llama3.2

# View model information
ollama show llama3.2

# Start the server manually
ollama serve

Recommended models to get started

The sizes are indicative. They may vary depending on the exact tag, the quantization, and the published version of the model. Before downloading a large model, it is worth checking its entry in the Ollama library.

Use	Suggested model	Approximate size	Comment
Lightweight chat	`llama3.2`	~2 GB	Good for modest machines
General chat	`gemma3`	~3 GB to ~17 GB	Depends on the chosen tag; start with smaller variants
General chat / reasoning	`qwen3`	~3 GB to ~5 GB	Good performance in various scenarios
Code	`qwen2.5-coder`	~5 GB	Specialized in programming
Embeddings	`embeddinggemma`	<1 GB	Recommended for embeddings
Lightweight embeddings	`all-minilm`	<1 GB	Fast and lightweight

If your machine has little VRAM, for example 4 GB, it is best to start with lightweight models and check with ollama ps whether they fit entirely on the GPU.

To download a specific variant use:

ollama pull model-name:tag

Example:

ollama pull llama3.1:8b

9. Where models are stored

The exact location can vary depending on the operating system and the installation method.

System	Typical location
Windows	`%USERPROFILE%\.ollama` or `%HOMEPATH%\.ollama`
macOS	`~/.ollama`
Linux	Usually under the Ollama user/service; may vary by installation

Windows

To open the models and configuration folder:

explorer %USERPROFILE%\.ollama

To open logs:

explorer %LOCALAPPDATA%\Ollama

macOS

Models and configuration:

ls ~/.ollama

Logs:

ls ~/.ollama/logs
cat ~/.ollama/logs/server.log

Linux

View service logs:

journalctl -e -u ollama

Or in follow mode:

journalctl -u ollama --no-pager --follow --pager-end

10. Environment variables by operating system

Ollama is configured through environment variables.

Common variables:

Variable	Use
`OLLAMA_HOST`	Address and port where Ollama listens.
`OLLAMA_MODELS`	Folder where models are stored.
`OLLAMA_KEEP_ALIVE`	Time a model stays loaded. Default: 5 minutes.
`OLLAMA_NUM_PARALLEL`	Control of parallel requests.
`OLLAMA_ORIGINS`	Allowed origins for CORS.
`OLLAMA_CONTEXT_LENGTH`	Default effective context for models, when applicable.
`OLLAMA_FLASH_ATTENTION`	Can enable Flash Attention if the hardware supports it.

10.1 Windows

On Windows, Ollama inherits environment variables from the user or the system.

Steps:

Quit Ollama from the system tray.
Open Settings or Control Panel.
Search for “environment variables”.
Edit the user environment variables.
Create or edit variables such as OLLAMA_MODELS, OLLAMA_HOST, OLLAMA_KEEP_ALIVE.
Save.
Reopen Ollama from the Start menu.

Examples:

OLLAMA_MODELS=D:\OllamaModels
OLLAMA_HOST=127.0.0.1:11434
OLLAMA_KEEP_ALIVE=5m

10.2 macOS

If Ollama runs as a macOS application, configure variables with launchctl.

Examples:

launchctl setenv OLLAMA_HOST "127.0.0.1:11434"
launchctl setenv OLLAMA_MODELS "$HOME/OllamaModels"
launchctl setenv OLLAMA_KEEP_ALIVE "5m"

Then quit and reopen Ollama.app.

To check a variable:

launchctl getenv OLLAMA_HOST

10.3 Linux

If Ollama runs as a systemd service, edit the service:

sudo systemctl edit ollama.service

Add:

[Service]
Environment="OLLAMA_HOST=127.0.0.1:11434"
Environment="OLLAMA_MODELS=/mnt/models/ollama"
Environment="OLLAMA_KEEP_ALIVE=5m"

Apply changes:

sudo systemctl daemon-reload
sudo systemctl restart ollama
sudo systemctl status ollama

Verify variables applied to the service:

systemctl show ollama | grep Environment

View logs:

journalctl -e -u ollama

10.4 Access from other machines on the local network

By default, Ollama listens on 127.0.0.1:11434, that is, it only accepts connections from the same machine.

To accept connections from other machines on the local network, you can use:

OLLAMA_HOST=0.0.0.0:11434

Then you must restart Ollama according to the operating system.

Once configured, from another machine on the network you could call:

curl http://MACHINE_IP:11434

Security warning: do not expose Ollama directly to the internet. Local Ollama does not include native authentication intended for public exposure. If you need remote access, use a firewall, VPN, reverse proxy with authentication, and restrictive network rules.

11. Ollama REST API

The local Ollama API is available at:

http://localhost:11434/api

11.1 `/api/generate` endpoint

This endpoint generates text from a prompt.

Windows PowerShell

(Invoke-WebRequest `
  -Method POST `
  -Uri http://localhost:11434/api/generate `
  -ContentType "application/json" `
  -Body '{
    "model": "llama3.2",
    "prompt": "Explain what logistic regression is.",
    "stream": false
  }'
).Content | ConvertFrom-Json

macOS / Linux

curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "prompt": "Explain what logistic regression is.",
    "stream": false
  }'

11.2 `/api/chat` endpoint

This endpoint lets you work with messages and roles.

Windows PowerShell

(Invoke-WebRequest `
  -Method POST `
  -Uri http://localhost:11434/api/chat `
  -ContentType "application/json" `
  -Body '{
    "model": "llama3.2",
    "messages": [
      {
        "role": "system",
        "content": "You are an expert assistant in statistics."
      },
      {
        "role": "user",
        "content": "Explain the difference between variance and standard deviation."
      }
    ],
    "stream": false
  }'
).Content | ConvertFrom-Json

macOS / Linux

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {
        "role": "system",
        "content": "You are an expert assistant in statistics."
      },
      {
        "role": "user",
        "content": "Explain the difference between variance and standard deviation."
      }
    ],
    "stream": false
  }'

11.3 `keep_alive` in the API

By default, Ollama keeps the model loaded in memory for about 5 minutes after a query. This improves response time if you make several queries in a row.

To unload the model immediately after responding:

curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "prompt": "Briefly: what is Ollama?",
    "stream": false,
    "keep_alive": 0
  }'

To keep it loaded indefinitely while the server keeps running:

curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "prompt": "Preload the model.",
    "stream": false,
    "keep_alive": -1
  }'

It can also be released manually from the CLI:

ollama stop llama3.2

12. Python integration

Install the official client:

pip install ollama

Basic chat

import ollama

response = ollama.chat(
    model="llama3.2",
    messages=[
        {
            "role": "user",
            "content": "Give me 3 techniques to avoid overfitting."
        }
    ]
)

print(response["message"]["content"])

Streaming

import ollama

stream = ollama.chat(
    model="llama3.2",
    messages=[
        {
            "role": "user",
            "content": "Tell me a short story about AI."
        }
    ],
    stream=True
)

for chunk in stream:
    print(chunk["message"]["content"], end="", flush=True)

Chat with system prompt

import ollama

response = ollama.chat(
    model="llama3.2",
    messages=[
        {
            "role": "system",
            "content": "Always respond in clear English with simple examples."
        },
        {
            "role": "user",
            "content": "What is a neural network?"
        }
    ]
)

print(response["message"]["content"])

13. Embeddings for RAG

Embeddings convert text into numerical vectors. They are fundamental for building semantic search systems and RAG.

Download an embeddings model

ollama pull embeddinggemma

You can also evaluate models such as:

ollama pull all-minilm

`/api/embed` endpoint

Windows PowerShell

(Invoke-WebRequest `
  -Method POST `
  -Uri http://localhost:11434/api/embed `
  -ContentType "application/json" `
  -Body '{
    "model": "embeddinggemma",
    "input": "The cat sat on the mat."
  }'
).Content | ConvertFrom-Json

macOS / Linux

curl http://localhost:11434/api/embed \
  -H "Content-Type: application/json" \
  -d '{
    "model": "embeddinggemma",
    "input": "The cat sat on the mat."
  }'

Embeddings with Python

import ollama

resp = ollama.embed(
    model="embeddinggemma",
    input="The cat sat on the mat."
)

vector = resp["embeddings"][0]
print(len(vector))
print(vector[:5])

Mini semantic search with Python

import ollama
import numpy as np

texts = [
    "The perceptron is a linear classification model.",
    "Convolutional networks are useful for images.",
    "Transformers use attention mechanisms.",
    "Overfitting occurs when a model memorizes the training data too much."
]

resp = ollama.embed(
    model="embeddinggemma",
    input=texts
)

vectors = np.array(resp["embeddings"])

query = "Which architecture uses attention?"
resp_query = ollama.embed(
    model="embeddinggemma",
    input=query
)

v_query = np.array(resp_query["embeddings"][0])

similarities = vectors @ v_query / (
    np.linalg.norm(vectors, axis=1) * np.linalg.norm(v_query)
)

idx = similarities.argmax()

print("Most relevant text:")
print(texts[idx])
print("Similarity:", similarities[idx])

14. How to know if it is using CPU or GPU

One of the most frequent questions when using Ollama is whether the model is actually using the GPU.

It is not advisable to rely solely on the operating system’s visual graphs. The Windows Task Manager, macOS Activity Monitor, or some graphical tools may show partial or hard-to-interpret information.

The most direct way to check it from Ollama is:

ollama ps

The important column is PROCESSOR.

Conceptual example:

NAME        ID              SIZE      PROCESSOR    CONTEXT    UNTIL
llama3.2    abc123          2.0 GB    100% GPU     4096       4 minutes from now

The CONTEXT column indicates the effective context loaded for that model. We will look at it in more detail in the next section, dedicated to context length.

Value	Meaning
`100% GPU`	The model is fully loaded on GPU.
`100% CPU`	The model is loaded into system memory and runs on CPU.
`48%/52% CPU/GPU`	Part of the model is on CPU/RAM and the other part is on GPU/VRAM.

If the model appears split between CPU and GPU, it can work, but it will normally be slower than if it fully fits in VRAM.

Why there can be CPU usage even when it says `100% GPU`

Even if ollama ps shows 100% GPU, it is normal to see some CPU usage. The CPU still participates in tasks such as:

tokenization;
execution coordination;
input and output;
sampling;
communication with the local server.

For that reason, do not interpret “CPU usage” as automatically meaning “it is not using the GPU”.

Windows: be careful with Task Manager

On Windows, Task Manager can be misleading. The 3D graph does not always represent CUDA compute or the actual model usage.

For NVIDIA, it is better to open another terminal and run:

nvidia-smi

Also run:

ollama ps

while the model is responding.

If ollama ps shows 100% GPU, the model fully loaded on GPU even if the Task Manager 3D graph does not seem to move.

Linux with NVIDIA

nvidia-smi
ollama ps

nvidia-smi lets you see VRAM used, active processes, and GPU load.

Linux with AMD

Verification depends on the AMD/ROCm installation and the distribution. Even so, ollama ps is still the first practical diagnostic to see whether Ollama loaded the model on CPU, GPU, or both.

macOS

On Apple Silicon, Ollama can use Metal acceleration. For a practical diagnostic:

ollama ps

You can also review general consumption with Activity Monitor, but the most useful interpretation for Ollama is usually the PROCESSOR column.

Objective metric: tokens per second

To compare actual performance, looking at CPU/GPU graphs is not enough. The most useful metric is the generation speed.

A simple way to see it from the CLI is to run the model in verbose mode:

ollama run llama3.2 --verbose

The --verbose flag works the same on Windows, macOS, and Linux. It can be used from PowerShell, Terminal, or a Bash shell.

When the response finishes, Ollama displays metrics such as:

prompt eval rate:  1200.00 tokens/s
eval rate:           45.00 tokens/s

The eval rate metric indicates approximately how many tokens per second the model generates. The higher, the faster the generation.

From the API, the final responses also include metrics such as eval_count and eval_duration. The approximate speed can be calculated like this:

tokens_per_second = eval_count / eval_duration * 1_000_000_000

because eval_duration is expressed in nanoseconds.

Quick diagnosis when a model is slow

Run ollama ps.
Check whether PROCESSOR says 100% GPU, 100% CPU, or a mix.
If there is a CPU/GPU mix, try a smaller model or a smaller context.
Measure with --verbose.
Compare eval rate between models.
Verify VRAM with nvidia-smi if using NVIDIA.
Avoid assuming that Task Manager reflects all of the model’s compute activity.

15. Context length

The context length is the maximum number of tokens that the model can consider in a query.

This matters greatly in:

RAG;
analysis of long documents;
agents;
code assistants;
long conversations.

Maximum context vs. effective context

We must distinguish two concepts:

Concept	Meaning
Maximum model context	What the model could support based on its architecture or configuration.
Effective context	What Ollama actually uses in a specific execution.

The effective context is controlled with num_ctx in the API or with equivalent configuration. If not specified, Ollama applies a default value that may be smaller than the maximum context supported by the model.

This often surprises those coming from cloud APIs, where it is often assumed that all the available context is automatically being used.

View context and offloading

Run:

ollama ps

In recent versions, the output may include a CONTEXT column, in addition to PROCESSOR.

Conceptual example:

NAME             ID              SIZE      PROCESSOR    CONTEXT    UNTIL
gemma3:latest    a2af6cc3eb7f    6.6 GB    100% GPU     65536      2 minutes from now

This lets you review two things at once:

how much context is being used;
whether the model loaded fully on GPU or ended up split between CPU/GPU.

Configure context in the API

Example use in /api/chat:

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {
        "role": "user",
        "content": "Summarize this long text..."
      }
    ],
    "stream": false,
    "options": {
      "num_ctx": 8192
    }
  }'

More context is not always better

Increasing the context can let you work with longer texts, but it can also:

consume more RAM or VRAM;
cause part of the model to be offloaded to CPU;
reduce tokens per second;
increase response time.

In local models, it is best to measure. For many simple tasks, a smaller context can be faster and sufficient.

Global configuration

It can also be configured via an environment variable:

OLLAMA_CONTEXT_LENGTH=8192

After changing it, restart Ollama.

16. OpenAI API compatibility

Ollama also offers compatibility with OpenAI-style endpoints.

This lets you reuse part of the existing code by changing the base URL.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

response = client.chat.completions.create(
    model="llama3.2",
    messages=[
        {
            "role": "user",
            "content": "Hi, explain what Ollama is."
        }
    ]
)

print(response.choices[0].message.content)

The API key can be any string when using the local server, because local Ollama does not validate an API key as a cloud provider would.

17. Best practices

1. Do not download models without criteria

Each model can take up several GB. Before downloading, review size, purpose, and requirements.

ollama show llama3.2

2. Use explicit tags when reproducibility matters

In real projects, avoid relying on ambiguous aliases.

Better:

ollama run llama3.1:8b

Instead of always relying on a generic alias.

3. Separate generation model and embeddings model

In RAG systems, it is common to use:

a fast embeddings model;
a higher-quality chat/generation model.

Example:

Embeddings: embeddinggemma
Generation: llama3.2, gemma3, qwen3, or another chat model

4. Do not expose Ollama directly to the internet

If you set:

OLLAMA_HOST=0.0.0.0:11434

Ollama can accept connections from other machines.

This can be useful on a local network, but it must not be exposed directly to the internet without authentication, firewall, and a secure reverse proxy.

5. Measure performance

To compare models:

measure load time;
measure tokens per second;
measure RAM/VRAM consumption;
check whether it uses CPU or GPU with ollama ps.

6. Mind CORS

If you consume Ollama from a local web application, it may be necessary to configure OLLAMA_ORIGINS.

In development you can use something broad, but in production it must be restricted.

18. Common errors by operating system

18.1 Windows

`ollama` is not recognized as a command

Close the terminal and open a new one.

If it still fails, verify that Ollama is installed and that the user PATH has been updated.

`curl` does not work as expected

In classic PowerShell, use:

curl.exe http://localhost:11434

Or use:

Invoke-WebRequest http://localhost:11434

Strange characters in the progress bar

This may happen with old fonts on Windows 10. Change the terminal font, for example to Cascadia Code, or use Windows Terminal.

Low disk space on C:

Configure:

OLLAMA_MODELS=D:\OllamaModels

Then restart Ollama.

It does not use the GPU

Check:

ollama ps

Also verify NVIDIA/AMD drivers and GPU usage in Task Manager.

18.2 macOS

The CLI does not appear

Open Ollama.app and accept the creation of the CLI link if requested.

You can also verify:

which ollama

Environment variables do not apply

Use launchctl setenv and restart Ollama.app.

Example:

launchctl setenv OLLAMA_HOST "127.0.0.1:11434"

Low performance on Intel Macs

On Intel/x86 Macs, expect CPU execution. For better local performance, Apple Silicon M-series is usually more suitable.

View logs

cat ~/.ollama/logs/server.log

18.3 Linux

The service does not start

sudo systemctl status ollama

View logs

journalctl -e -u ollama

Or:

journalctl -u ollama --no-pager --follow --pager-end

I changed variables and they do not apply

After editing the service:

sudo systemctl daemon-reload
sudo systemctl restart ollama

NVIDIA GPU not detected

Verify:

nvidia-smi

If it does not work, review the NVIDIA driver installation.

AMD GPU not detected

Review AMD/ROCm drivers depending on the distribution and hardware.

19. Proposed exercises

Exercise 1 — Installation and verification

Install Ollama on your operating system and submit:

ollama --version
ollama list

Then run:

curl http://localhost:11434

On Windows you can use:

curl.exe http://localhost:11434

Exercise 2 — First model

Download and run:

ollama run llama3.2

Prompt:

Explain supervised learning in one sentence and then give a practical example.

Save the response.

Exercise 3 — Model comparison

Compare two models, for example:

ollama run llama3.2
ollama run gemma3

Use the same prompt:

Explain Bayes' theorem with a medical example.

Compare:

response quality;
perceived speed;
RAM/VRAM consumption;
CPU/GPU usage with ollama ps.

Exercise 4 — REST API

Call /api/chat from the corresponding operating system.

Windows

(Invoke-WebRequest `
  -Method POST `
  -Uri http://localhost:11434/api/chat `
  -ContentType "application/json" `
  -Body '{
    "model": "llama3.2",
    "messages": [
      {
        "role": "user",
        "content": "Give me 5 ideas for simple projects with local AI."
      }
    ],
    "stream": false
  }'
).Content | ConvertFrom-Json

macOS / Linux

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2",
    "messages": [
      {
        "role": "user",
        "content": "Give me 5 ideas for simple projects with local AI."
      }
    ],
    "stream": false
  }'

Exercise 5 — Python client

Create a script that:

Reads questions from a questions.txt file.
Sends each question to the model using ollama.chat.
Saves the responses in a JSON file with a timestamp.

Base example:

import json
from datetime import datetime
import ollama

with open("questions.txt", "r", encoding="utf-8") as f:
    questions = [line.strip() for line in f if line.strip()]

results = []

for question in questions:
    response = ollama.chat(
        model="llama3.2",
        messages=[
            {
                "role": "user",
                "content": question
            }
        ]
    )

    results.append({
        "question": question,
        "response": response["message"]["content"],
        "timestamp": datetime.now().isoformat()
    })

with open("responses.json", "w", encoding="utf-8") as f:
    json.dump(results, f, ensure_ascii=False, indent=2)

Exercise 6 — Complete mini RAG

In the embeddings section we already saw how to find the most similar paragraph. In this exercise you must complete the RAG flow:

Take 10 paragraphs about artificial intelligence.
Generate embeddings with embeddinggemma.
Save the vectors in memory.
Given a question, find the most similar paragraph.
Pass that paragraph as context to the chat model.
Ask the model to answer using only that context.
If the context is not enough, the model must respond: “I do not have enough information in the context”.

Suggested prompt for generation:

Answer the question using only the provided context.

If the context does not contain enough information, respond:
"I do not have enough information in the context".

Context:
{{context}}

Question:
{{question}}

Exercise 7 — Performance analysis

Run the same prompt with two different models and record:

load time;
total time;
CPU/GPU usage;
perceived quality;
model size.

Use:

ollama ps

20. Closing

With this tutorial you can now install and use Ollama on Windows, macOS, and Linux, run local models, consume the REST API, integrate it with Python, and generate embeddings for RAG systems.

Ollama is a very useful tool to learn applied AI, experiment with local models, and build private prototypes. It does not always replace the more advanced cloud models, but it does help you better understand the architecture of a modern artificial intelligence solution.

In upcoming modules you can advance toward:

complete RAG systems;
integration with vector databases;
agents;
internal tools for companies;
code assistants;
controlled local deployments.

Recommended official sources

General Ollama documentation: https://docs.ollama.com
Ollama API: https://docs.ollama.com/api/introduction
Windows installation: https://docs.ollama.com/windows
macOS installation: https://docs.ollama.com/macos
Linux installation: https://docs.ollama.com/linux
Embeddings: https://docs.ollama.com/capabilities/embeddings
/api/embed endpoint: https://docs.ollama.com/api/embed
Troubleshooting: https://docs.ollama.com/troubleshooting
Context length: https://docs.ollama.com/context-length
Usage and performance metrics: https://docs.ollama.com/api/usage
Model library: https://ollama.com/library

Hands-on tutorial: Ollama on Windows, macOS, and Linux

Table of contents

1. What is Ollama

Main components

2. When to use local models

Quick comparison

3. System requirements

General requirements

Windows

macOS

Linux

Practical model size rule

4. Installation by operating system

4.1 Windows

Option A: graphical installer

Option B: winget

Verify that the server responds

4.2 macOS

Recommended option: official application

Alternative with Homebrew

4.3 Linux

Installation with the official script

Security note

5. Verifying the installation

Windows

macOS

Linux

6. How to restart Ollama

Windows

macOS

Linux

7. First local model

What happened behind the scenes

8. Essential CLI commands

Recommended models to get started

9. Where models are stored

Windows

macOS

Linux

10. Environment variables by operating system

10.1 Windows

10.2 macOS

10.3 Linux

10.4 Access from other machines on the local network

11. Ollama REST API

11.1 /api/generate endpoint

Windows PowerShell

macOS / Linux

11.2 /api/chat endpoint

Windows PowerShell

macOS / Linux

11.3 keep_alive in the API

12. Python integration

Basic chat

Streaming

Chat with system prompt

13. Embeddings for RAG

Download an embeddings model

/api/embed endpoint

Windows PowerShell

macOS / Linux

Embeddings with Python

Mini semantic search with Python

14. How to know if it is using CPU or GPU

Why there can be CPU usage even when it says 100% GPU

Windows: be careful with Task Manager

Linux with NVIDIA

Linux with AMD

macOS

Objective metric: tokens per second

Quick diagnosis when a model is slow

15. Context length

Maximum context vs. effective context

View context and offloading

Configure context in the API

More context is not always better

Global configuration

16. OpenAI API compatibility

17. Best practices

1. Do not download models without criteria

11.1 `/api/generate` endpoint

11.2 `/api/chat` endpoint

11.3 `keep_alive` in the API

`/api/embed` endpoint

Why there can be CPU usage even when it says `100% GPU`

`ollama` is not recognized as a command

`curl` does not work as expected