forked from sysown/proxysql
-
Notifications
You must be signed in to change notification settings - Fork 0
v3.1-vec4: Add NLP search demo and fix data processing issues #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
d94dc03
Add StackExchange posts processing script with JSON storage
renecannao d37d291
Implement comprehensive StackExchange posts processing with search ca…
renecannao ecfff09
Add NLP search demo script with comprehensive search capabilities
renecannao 62cbd6c
Fix issues identified in AI code review
renecannao File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,79 +1,236 @@ | ||
| ### Scripts description | ||
| # StackExchange Posts Processor | ||
|
|
||
| This is a set of example scripts to show the capabilities of the RESTAPI interface and how to interface with it. | ||
| A comprehensive script to extract, process, and index StackExchange posts for search capabilities. | ||
|
|
||
| ### Prepare ProxySQL | ||
| ## Features | ||
|
|
||
| 1. Launch ProxySQL: | ||
| - ✅ **Complete Pipeline**: Extracts parent posts and replies from source database | ||
| - 📊 **Search Ready**: Creates full-text search indexes and processed text columns | ||
| - 🚀 **Efficient**: Batch processing with memory optimization | ||
| - 🔍 **Duplicate Prevention**: Skip already processed posts | ||
| - 📈 **Progress Tracking**: Real-time statistics and performance metrics | ||
| - 🔧 **Flexible**: Configurable source/target databases | ||
| - 📝 **Rich Output**: Structured JSON with tags and metadata | ||
|
|
||
| ## Database Schema | ||
|
|
||
| The script creates a comprehensive target table with these columns: | ||
|
|
||
| ```sql | ||
| processed_posts ( | ||
| PostId BIGINT PRIMARY KEY, | ||
| JsonData JSON NOT NULL, -- Complete post data | ||
| Embeddings BLOB NULL, -- For future ML embeddings | ||
| SearchText LONGTEXT NULL, -- Combined text for search | ||
| TitleText VARCHAR(1000) NULL, -- Cleaned title | ||
| BodyText LONGTEXT NULL, -- Cleaned body | ||
| RepliesText LONGTEXT NULL, -- Combined replies | ||
| Tags JSON NULL, -- Extracted tags | ||
| CreatedAt TIMESTAMP DEFAULT CURRENT_TIMESTAMP, | ||
| UpdatedAt TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, | ||
|
|
||
| -- Indexes | ||
| KEY idx_created_at (CreatedAt), | ||
| KEY idx_tags ((CAST(Tags AS CHAR(1000)))), -- JSON tag index | ||
| FULLTEXT INDEX ft_search (SearchText, TitleText, BodyText, RepliesText) | ||
| ) | ||
| ``` | ||
|
|
||
| ## Usage | ||
|
|
||
| ### Basic Usage | ||
|
|
||
| ```bash | ||
| # Process first 1000 posts | ||
| python3 stackexchange_posts.py --limit 1000 | ||
|
|
||
| # Process with custom batch size | ||
| python3 stackexchange_posts.py --limit 10000 --batch-size 500 | ||
|
|
||
| # Don't skip duplicates (process all posts) | ||
| python3 stackexchange_posts.py --limit 1000 --no-skip-duplicates | ||
| ``` | ||
|
|
||
| ### Advanced Configuration | ||
|
|
||
| ```bash | ||
| # Custom database connections | ||
| python3 stackexchange_posts.py \ | ||
| --source-host 192.168.1.100 \ | ||
| --source-port 3307 \ | ||
| --source-user myuser \ | ||
| --source-password mypass \ | ||
| --source-db my_stackexchange \ | ||
| --target-host 192.168.1.200 \ | ||
| --target-port 3306 \ | ||
| --target-user search_user \ | ||
| --target-password search_pass \ | ||
| --target-db search_db \ | ||
| --limit 50000 \ | ||
| --batch-size 1000 | ||
| ``` | ||
|
|
||
| ## Search Examples | ||
|
|
||
| Once processed, you can search the data using: | ||
|
|
||
| ### 1. MySQL Full-Text Search | ||
|
|
||
| ```sql | ||
| -- Basic search | ||
| SELECT PostId, Title | ||
| FROM processed_posts | ||
| WHERE MATCH(SearchText) AGAINST('mysql optimization' IN BOOLEAN MODE) | ||
| ORDER BY relevance DESC; | ||
|
|
||
| -- Boolean search operators | ||
| SELECT PostId, Title | ||
| FROM processed_posts | ||
| WHERE MATCH(SearchText) AGAINST('+database -oracle' IN BOOLEAN MODE); | ||
|
|
||
| -- Proximity search | ||
| SELECT PostId, Title | ||
| FROM processed_posts | ||
| WHERE MATCH(SearchText) AGAINST('"database performance"~5' IN BOOLEAN MODE); | ||
| ``` | ||
| ./proxysql -M --sqlite3-server --idle-threads -f -c $PROXYSQL_PATH/scripts/datadir/proxysql.cnf -D $PROXYSQL_PATH/scripts/datadir | ||
|
|
||
| ### 2. Tag-based Search | ||
|
|
||
| ```sql | ||
| -- Search by specific tags | ||
| SELECT PostId, Title | ||
| FROM processed_posts | ||
| WHERE JSON_CONTAINS(Tags, '"mysql"') AND JSON_CONTAINS(Tags, '"performance"'); | ||
| ``` | ||
|
|
||
| ### 3. Filtered Search | ||
|
|
||
| ```sql | ||
| -- Search within date range | ||
| SELECT PostId, Title, JSON_UNQUOTE(JSON_EXTRACT(JsonData, '$.CreationDate')) as CreationDate | ||
| FROM processed_posts | ||
| WHERE MATCH(SearchText) AGAINST('python' IN BOOLEAN MODE) | ||
| AND JSON_UNQUOTE(JSON_EXTRACT(JsonData, '$.CreationDate')) BETWEEN '2023-01-01' AND '2023-12-31'; | ||
| ``` | ||
|
|
||
| ## Performance Tips | ||
|
|
||
| 1. **Batch Size**: Use larger batches (1000-5000) for better throughput | ||
| 2. **Memory**: Adjust batch size based on available memory | ||
| 3. **Indexes**: The script automatically creates necessary indexes | ||
| 4. **Parallel Processing**: Consider running multiple instances with different offset ranges | ||
|
|
||
| ## Output Example | ||
|
|
||
| ``` | ||
| 🚀 StackExchange Posts Processor | ||
| ================================================== | ||
| Source: 127.0.0.1:3306/stackexchange | ||
| Target: 127.0.0.1:3306/stackexchange_post | ||
| Limit: 1000 posts | ||
| Batch size: 100 | ||
| Skip duplicates: True | ||
| ================================================== | ||
|
|
||
| ✅ Connected to source and target databases | ||
| ✅ Target table created successfully with all search columns | ||
|
|
||
| 2. Configure ProxySQL: | ||
| 🔄 Processing batch 1 - posts 1 to 100 | ||
| ⏭️ Skipping 23 duplicate posts | ||
| 📝 Processing 77 posts... | ||
| 📊 Batch inserted 77 posts | ||
| ⏱️ Progress: 100/1000 posts (10.0%) | ||
| 📈 Total processed: 77, Inserted: 77, Skipped: 23 | ||
| ⚡ Rate: 12.3 posts/sec | ||
|
|
||
| 🎉 Processing complete! | ||
| 📊 Total batches: 10 | ||
| 📝 Total processed: 800 | ||
| ✅ Total inserted: 800 | ||
| ⏭️ Total skipped: 200 | ||
| ⏱️ Total time: 45.2 seconds | ||
| 🚀 Average rate: 17.7 posts/sec | ||
|
|
||
| ✅ Processing completed successfully! | ||
| ``` | ||
| cd $RESTAPI_EXAMPLES_DIR | ||
| ./proxysql_config.sh | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Common Issues | ||
|
|
||
| 1. **Table Creation Failed**: Check database permissions | ||
| 2. **Memory Issues**: Reduce batch size | ||
| 3. **Slow Performance**: Optimize MySQL configuration | ||
| 4. **Connection Errors**: Verify database credentials | ||
|
|
||
| ### Maintenance | ||
|
|
||
| ```sql | ||
| -- Check table status | ||
| SHOW TABLE STATUS LIKE 'processed_posts'; | ||
|
|
||
| -- Rebuild full-text index | ||
| ALTER TABLE processed_posts DROP INDEX ft_search, | ||
| ADD FULLTEXT INDEX ft_search (SearchText, TitleText, BodyText, RepliesText); | ||
|
|
||
| -- Count processed posts | ||
| SELECT COUNT(*) FROM processed_posts; | ||
| ``` | ||
|
|
||
| 3. Install requirements | ||
| ## Requirements | ||
|
|
||
| - Python 3.7+ | ||
| - mysql-connector-python | ||
| - MySQL 5.7+ (for JSON and full-text support) | ||
|
|
||
| Install dependencies: | ||
| ```bash | ||
| pip install mysql-connector-python | ||
| ``` | ||
| cd $RESTAPI_EXAMPLES_DIR/requirements | ||
| ./install_requirements.sh | ||
|
|
||
| ## Other Scripts | ||
|
|
||
| The `scripts/` directory also contains other utility scripts: | ||
|
|
||
| - `nlp_search_demo.py` - Demonstrate various search techniques on processed posts: | ||
| - Full-text search with MySQL | ||
| - Boolean search with operators | ||
| - Tag-based JSON queries | ||
| - Combined search approaches | ||
| - Statistics and search analytics | ||
| - Data preparation for future semantic search | ||
|
|
||
| - `add_mysql_user.sh` - Add/replace MySQL users in ProxySQL | ||
| - `change_host_status.sh` - Change host status in ProxySQL | ||
| - `flush_query_cache.sh` - Flush ProxySQL query cache | ||
| - `kill_idle_backend_conns.py` - Kill idle backend connections | ||
| - `proxysql_config.sh` - Configure ProxySQL settings | ||
| - `stats_scrapper.py` - Scrape statistics from ProxySQL | ||
|
|
||
| ## Search Examples | ||
|
|
||
| ### Using the NLP Search Demo | ||
|
|
||
| ```bash | ||
| # Show search statistics | ||
| python3 nlp_search_demo.py --mode stats | ||
|
|
||
| # Full-text search | ||
| python3 nlp_search_demo.py --mode full-text --query "mysql performance optimization" | ||
|
|
||
| # Boolean search with operators | ||
| python3 nlp_search_demo.py --mode boolean --query "+database -oracle" | ||
|
|
||
| # Search by tags | ||
| python3 nlp_search_demo.py --mode tags --tags mysql performance --operator AND | ||
|
|
||
| # Combined search with text and tags | ||
| python3 nlp_search_demo.py --mode combined --query "python optimization" --tags python | ||
|
|
||
| # Prepare data for semantic search | ||
| python3 nlp_search_demo.py --mode similarity --query "machine learning" | ||
| ``` | ||
|
|
||
| ### Query the endpoints | ||
|
|
||
| 1. Flush Query Cache: `curl -i -X GET http://localhost:6070/sync/flush_query_cache` | ||
| 2. Change host status: | ||
| - Assuming local ProxySQL: | ||
| ``` | ||
| curl -i -X POST -d '{ "hostgroup_id": "0", "hostname": "127.0.0.1", "port": 13306, "status": "OFFLINE_HARD" }' http://localhost:6070/sync/change_host_status | ||
| ``` | ||
| - Specifying server: | ||
| ``` | ||
| curl -i -X POST -d '{ "admin_host": "127.0.0.1", "admin_port": "6032", "admin_user": "radmin", "admin_pass": "radmin", "hostgroup_id": "0", "hostname": "127.0.0.1", "port": 13306, "status": "OFFLINE_HARD" }' http://localhost:6070/sync/change_host_status | ||
| ``` | ||
| 2. Add or replace MySQL user: | ||
| - Assuming local ProxySQL: | ||
| ``` | ||
| curl -i -X POST -d '{ "user": "sbtest1", "pass": "sbtest1" }' http://localhost:6070/sync/add_mysql_user | ||
| ``` | ||
| - Add user and load to runtime (Assuming local instance): | ||
| ``` | ||
| curl -i -X POST -d '{ "user": "sbtest1", "pass": "sbtest1", "to_runtime": 1 }' http://localhost:6070/sync/add_mysql_user | ||
| ``` | ||
| - Specifying server: | ||
| ``` | ||
| curl -i -X POST -d '{ "admin_host": "127.0.0.1", "admin_port": "6032", "admin_user": "radmin", "admin_pass": "radmin", "user": "sbtest1", "pass": "sbtest1" }' http://localhost:6070/sync/add_mysql_user | ||
| ``` | ||
| 3. Kill idle backend connections: | ||
| - Assuming local ProxySQL: | ||
| ``` | ||
| curl -i -X POST -d '{ "timeout": 10 }' http://localhost:6070/sync/kill_idle_backend_conns | ||
| ``` | ||
| - Specifying server: | ||
| ``` | ||
| curl -i -X POST -d '{ "admin_host": "127.0.0.1", "admin_port": 6032, "admin_user": "radmin", "admin_pass": "radmin", "timeout": 10 }' http://localhost:6070/sync/kill_idle_backend_conns | ||
| ``` | ||
| 4. Scrap tables from 'stats' schema: | ||
| - Assuming local ProxySQL: | ||
| ``` | ||
| curl -i -X POST -d '{ "table": "stats_mysql_users" }' http://localhost:6070/sync/scrap_stats | ||
| ``` | ||
| - Specifying server: | ||
| ``` | ||
| curl -i -X POST -d '{ "admin_host": "127.0.0.1", "admin_port": 6032, "admin_user": "radmin", "admin_pass": "radmin", "table": "stats_mysql_users" }' http://localhost:6070/sync/scrap_stats | ||
| ``` | ||
| - Provoke script failure (non-existing table): | ||
| ``` | ||
| curl -i -X POST -d '{ "admin_host": "127.0.0.1", "admin_port": 6032, "admin_user": "radmin", "admin_pass": "radmin", "table": "stats_mysql_servers" }' http://localhost:6070/sync/scrap_stats | ||
| ``` | ||
|
|
||
| ### Scripts doc | ||
|
|
||
| - All scripts allows to perform the target operations on a local or remote ProxySQL instance. | ||
| - Notice how the unique 'GET' request is for 'QUERY CACHE' flushing, since it doesn't require any parameters. | ||
| - Script 'stats_scrapper.py' fails when a table that isn't present in 'stats' schema is queried. This is left as an example of the behavior of a failing script and ProxySQL log output. | ||
| ## License | ||
|
|
||
| Internal use only. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema documentation includes
KEY idx_tags ((CAST(Tags AS CHAR(1000)))). However, the pull request description and thestackexchange_posts.pyscript (line 79) state that this functional index was removed because it was problematic and caused connection drops. Including it in the documentation is misleading for users trying to understand or recreate the schema. It should be removed or commented with a note about why it's disabled.