Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
295 changes: 295 additions & 0 deletions doc/posts-embeddings-setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,295 @@
# Posts Table Embeddings Setup Guide

This guide explains how to set up and populate virtual tables for storing and searching embeddings of the Posts table content using sqlite-rembed and sqlite-vec extensions in ProxySQL.

## Prerequisites

1. **ProxySQL** running with SQLite3 backend enabled (`--sqlite3-server` flag)
2. **Posts table** copied from MySQL to SQLite3 server (248,905 rows)
- Use `scripts/copy_stackexchange_Posts_mysql_to_sqlite3.py` if not already copied
3. **Valid API credentials** for embedding generation
4. **Network access** to embedding API endpoint

## Setup Steps

### Step 1: Create Virtual Vector Table

Create a virtual table for storing 768-dimensional embeddings (matching nomic-embed-text-v1.5 model output):

```sql
-- Create virtual vector table for Posts embeddings
CREATE VIRTUAL TABLE Posts_embeddings USING vec0(
embedding float[768]
);
```

### Step 2: Configure API Client

Configure an embedding API client using the `temp.rembed_clients` virtual table:

```sql
-- Configure embedding API client
-- Replace YOUR_API_KEY with actual API key
INSERT INTO temp.rembed_clients(name, options) VALUES
('posts-embed-client',
rembed_client_options(
'format', 'openai',
'url', 'https://api.synthetic.new/openai/v1/embeddings',
'key', 'YOUR_API_KEY',
'model', 'hf:nomic-ai/nomic-embed-text-v1.5'
)
);
```

### Step 3: Generate and Insert Embeddings

#### For Testing (First 100 rows)

```sql
-- Generate embeddings for first 100 Posts
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT rowid, rembed('posts-embed-client',
COALESCE(Title || ' ', '') || Body) as embedding
FROM Posts
LIMIT 100;
```

#### For Full Table (Batch Processing)

Use this optimized batch query that processes unembedded rows without requiring rowid tracking:

```sql
-- Batch process unembedded rows (processes ~1000 rows at a time)
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT Posts.rowid, rembed('posts-embed-client',
COALESCE(Posts.Title || ' ', '') || Posts.Body) as embedding
FROM Posts
LEFT JOIN Posts_embeddings ON Posts.rowid = Posts_embeddings.rowid
WHERE Posts_embeddings.rowid IS NULL
LIMIT 1000;
```

**Key features of this batch query:**
- Uses `LEFT JOIN` to find Posts without existing embeddings
- `WHERE Posts_embeddings.rowid IS NULL` filters for unprocessed rows
- `LIMIT 1000` controls batch size
- Can be run repeatedly until all rows are processed
- No need to track which rowids have been processed

### Step 4: Verify Embeddings

```sql
-- Check total embeddings count
SELECT COUNT(*) as total_embeddings FROM Posts_embeddings;

-- Check embedding size (should be 3072 bytes: 768 dimensions × 4 bytes)
SELECT rowid, length(embedding) as embedding_size_bytes
FROM Posts_embeddings LIMIT 3;

-- Check percentage of Posts with embeddings
SELECT
(SELECT COUNT(*) FROM Posts_embeddings) as with_embeddings,
(SELECT COUNT(*) FROM Posts) as total_posts,
ROUND(
(SELECT COUNT(*) FROM Posts_embeddings) * 100.0 /
(SELECT COUNT(*) FROM Posts), 2
) as percentage_complete;
```

## Batch Processing Strategy for 248,905 Rows

### Recommended Approach

1. **Run the batch query repeatedly** until all rows have embeddings
2. **Add delays between batches** to avoid API rate limiting
3. **Monitor progress** using the verification queries above

### Example Shell Script for Batch Processing

```bash
#!/bin/bash
# process_posts_embeddings.sh

PROXYSQL_HOST="127.0.0.1"
PROXYSQL_PORT="6030"
MYSQL_USER="root"
MYSQL_PASS="root"
BATCH_SIZE=1000
DELAY_SECONDS=5

echo "Starting Posts embeddings generation..."

while true; do
# Execute batch query
mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" << EOF

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Passing the password on the command line with -p"$MYSQL_PASS" can be a security risk, as the password may be visible in the system's process list. For better security, consider using a MySQL configuration file (~/.my.cnf) or the MYSQL_PWD environment variable. Using MYSQL_PWD would be consistent with how API_KEY is handled for the remote API. This applies to all mysql calls in this script.

INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT Posts.rowid, rembed('posts-embed-client',
COALESCE(Posts.Title || ' ', '') || Posts.Body) as embedding
FROM Posts
LEFT JOIN Posts_embeddings ON Posts.rowid = Posts_embeddings.rowid
WHERE Posts_embeddings.rowid IS NULL
LIMIT $BATCH_SIZE;
EOF

# Check if any rows were processed
PROCESSED=$(mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" -s -N << EOF
SELECT COUNT(*) FROM Posts_embeddings;
EOF)

TOTAL=$(mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" -s -N << EOF
SELECT COUNT(*) FROM Posts;
EOF)
Comment on lines +139 to +141

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The query to get the TOTAL count of posts is incorrect. It counts all rows in the Posts table, but the embedding process only considers a subset of posts (specifically, PostTypeId IN (1,2) and text length > 30). This will cause the script to get stuck in an infinite loop because the PROCESSED count will never reach the TOTAL count. The query for TOTAL should apply the same filters as the processing script.

Suggested change
TOTAL=$(mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" -s -N << EOF
SELECT COUNT(*) FROM Posts;
EOF)
TOTAL=$(mysql -h "$PROXYSQL_HOST" -P "$PROXYSQL_PORT" -u "$MYSQL_USER" -p"$MYSQL_PASS" -s -N << EOF
SELECT COUNT(*) FROM Posts
WHERE Posts.PostTypeId IN (1,2)
AND LENGTH(COALESCE(Posts.Title || ' ', '') || Posts.Body) > 30;
EOF)


PERCENTAGE=$(echo "scale=2; $PROCESSED * 100 / $TOTAL" | bc)
echo "Processed: $PROCESSED/$TOTAL rows ($PERCENTAGE%)"

# Break if all rows processed
if [ "$PROCESSED" -eq "$TOTAL" ]; then
echo "All rows processed!"
break
fi

# Wait before next batch
echo "Waiting $DELAY_SECONDS seconds before next batch..."
sleep $DELAY_SECONDS
done
```

## Similarity Search Examples

Once embeddings are generated, you can perform semantic search:

### Example 1: Find Similar Posts

```sql
-- Find Posts similar to a query about databases
SELECT p.SiteId, p.Id as PostId, p.Title, e.distance,
substr(p.Body, 1, 100) as body_preview
FROM (
SELECT rowid, distance
FROM Posts_embeddings
WHERE embedding MATCH rembed('posts-embed-client',
'database systems and SQL queries')
LIMIT 5
) e
JOIN Posts p ON e.rowid = p.rowid
ORDER BY e.distance;
```

### Example 2: Find Posts Similar to Specific Post

```sql
-- Find Posts similar to Post with ID 1
SELECT p2.SiteId, p2.Id as PostId, p2.Title, e.distance,
substr(p2.Body, 1, 100) as body_preview
FROM (
SELECT rowid, distance
FROM Posts_embeddings
WHERE embedding MATCH (
SELECT embedding
FROM Posts_embeddings
WHERE rowid = 1 -- Change to target Post rowid
)
AND rowid != 1
LIMIT 5
) e
JOIN Posts p2 ON e.rowid = p2.rowid
ORDER BY e.distance;
```

## Performance Considerations

1. **API Rate Limiting**: The `rembed()` function makes HTTP requests to the API
- Batch size of 1000 with 5-second delays is conservative
- Adjust based on API rate limits
- Monitor API usage and costs

2. **Embedding Storage**:
- Each embedding: 768 dimensions × 4 bytes = 3,072 bytes
- Full table (248,905 rows): ~765 MB
- Ensure sufficient disk space

3. **Search Performance**:
- `vec0` virtual tables use approximate nearest neighbor search
- Performance scales with number of vectors and dimensions
- Use `LIMIT` clauses to control result size

## Troubleshooting

### Common Issues

1. **API Connection Errors**
- Verify API key is valid and has quota
- Check network connectivity to API endpoint
- Confirm API endpoint URL is correct

2. **Embedding Generation Failures**
- Check `temp.rembed_clients` configuration
- Verify client name matches in `rembed()` calls
- Test with simple text first: `SELECT rembed('posts-embed-client', 'test');`

3. **Batch Processing Stalls**
- Check if API rate limits are being hit
- Increase delay between batches
- Reduce batch size

4. **Memory Issues**
- Large batches may consume significant memory
- Reduce batch size if encountering memory errors
- Monitor ProxySQL memory usage

### Verification Queries

```sql
-- Check API client configuration
SELECT name, json_extract(options, '$.format') as format,
json_extract(options, '$.model') as model
FROM temp.rembed_clients;

-- Test embedding generation
SELECT length(rembed('posts-embed-client', 'test text')) as test_embedding_size;

-- Check for embedding generation errors
SELECT rowid FROM Posts_embeddings WHERE length(embedding) != 3072;
```

## Maintenance

### Adding New Posts

When new Posts are added to the table:

```sql
-- Generate embeddings for new Posts
INSERT OR REPLACE INTO Posts_embeddings(rowid, embedding)
SELECT Posts.rowid, rembed('posts-embed-client',
COALESCE(Posts.Title || ' ', '') || Posts.Body) as embedding
FROM Posts
LEFT JOIN Posts_embeddings ON Posts.rowid = Posts_embeddings.rowid
WHERE Posts_embeddings.rowid IS NULL;
```

### Recreating Virtual Table

If you need to recreate the virtual table:

```sql
-- Drop existing table
DROP TABLE IF EXISTS Posts_embeddings;

-- Recreate with same schema
CREATE VIRTUAL TABLE Posts_embeddings USING vec0(
embedding float[768]
);
```

## Related Resources

1. [sqlite-rembed Integration Documentation](./sqlite-rembed-integration.md)
2. [SQLite3 Server Documentation](./SQLite3-Server.md)
3. [Vector Search Testing](../doc/vector-search-test/README.md)
4. [Copy Script](../scripts/copy_stackexchange_Posts_mysql_to_sqlite3.py)

---

*Last Updated: $(date)*
Loading