Files
timebank-cc-public/references/ELASTICSEARCH_SETUP.md
Ronald Huynen 2547717edb Initial commit
2026-03-23 21:37:59 +01:00

962 lines
22 KiB
Markdown

# Elasticsearch Setup Guide
This guide documents the Elasticsearch search engine setup for full-text search, multilingual content indexing, and location-based queries in the application.
## SECURITY
**Elasticsearch is a DATABASE and must NEVER be exposed to the internet without proper security!**
### Default Security Risks
By default, Elasticsearch 7.x ships with:
- **NO authentication** - anyone can read/write/delete all data
- **NO encryption** - all data transmitted in plain text
- **NO access control** - full admin access for anyone who connects
### Required Security Configuration
**For Development (Local Machine Only):**
```yaml
# /etc/elasticsearch/elasticsearch.yml
network.host: 127.0.0.1 # ONLY localhost - NOT 0.0.0.0!
http.port: 9200
xpack.security.enabled: false # OK for localhost-only
```
**For Production/Remote Servers:**
```yaml
# /etc/elasticsearch/elasticsearch.yml
network.host: 127.0.0.1 # ONLY localhost - use reverse proxy if needed
http.port: 9200
xpack.security.enabled: true # REQUIRED for any server accessible remotely
xpack.security.transport.ssl.enabled: true
```
### Verify Your Server Is NOT Exposed
**Check what interface Elasticsearch is listening on:**
```bash
ss -tlnp | grep 9200
```
**SAFE** - Should show ONLY localhost addresses:
```
127.0.0.1:9200 # IPv4 localhost
[::1]:9200 # IPv6 localhost
[::ffff:127.0.0.1]:9200 # IPv6-mapped IPv4 localhost (also safe!)
```
**Note**: The `[::ffff:127.0.0.1]` format is the IPv6 representation of IPv4 localhost - it's still localhost-only and secure.
**DANGER** - If you see any of these, YOU ARE EXPOSED:
```
0.0.0.0:9200 # Listening on ALL interfaces - EXPOSED!
*:9200 # Listening on ALL interfaces - EXPOSED!
YOUR_PUBLIC_IP:9200 # Listening on public IP - EXPOSED!
```
**Test external accessibility:**
```bash
# From another machine or from the internet
curl http://YOUR_SERVER_IP:9200
# Should get: Connection refused (GOOD!)
# If you get a JSON response - YOU ARE EXPOSED TO THE INTERNET!
```
### What Happens If Exposed?
If Elasticsearch is exposed to the internet without authentication:
1. Attackers can **read all your data** (users, emails, private information)
2. Attackers can **delete all your indices** (all search data gone)
3. Attackers can **modify data** (corrupt your search results)
4. Attackers can **execute scripts** (potential remote code execution)
**Real-world attacks:**
- Ransomware attacks encrypting Elasticsearch data
- Mass data exfiltration of exposed databases
- Bitcoin mining malware installation
- Complete data deletion with ransom demands
### Immediate Actions If You Discover Exposure
1. **IMMEDIATELY stop Elasticsearch:**
```bash
sudo systemctl stop elasticsearch
```
2. **Fix the configuration:**
```bash
sudo nano /etc/elasticsearch/elasticsearch.yml
# Set: network.host: 127.0.0.1
# Set: xpack.security.enabled: true
```
3. **Enable authentication and set passwords:**
```bash
sudo /usr/share/elasticsearch/bin/elasticsearch-setup-passwords interactive
```
4. **Restart with fixed configuration:**
```bash
sudo systemctl start elasticsearch
```
5. **Verify it's no longer accessible:**
```bash
curl http://YOUR_SERVER_IP:9200
# Should show: Connection refused
```
6. **Review logs for unauthorized access:**
```bash
sudo grep -i "unauthorized\|access denied\|failed\|401\|403" /var/log/elasticsearch/*.log
```
---
## Overview
The application uses **Elasticsearch 7.17.24** with Laravel Scout for:
- Full-text search across Users, Organizations, Banks, and Posts
- Multilingual search with language-specific analyzers (EN, NL, DE, ES, FR)
- Location-based search with edge n-gram tokenization
- Skill and tag matching with boost factors
- Autocomplete suggestions
- Custom search optimization with configurable boost factors
**Scout Driver**: `matchish/laravel-scout-elasticsearch` v7.12.0
**Elasticsearch Client**: `elasticsearch/elasticsearch` v8.19.0
## Prerequisites
- PHP 8.3+ with required extensions
- MySQL/MariaDB database (primary data source)
- Redis server (for Scout queue)
- Java Runtime Environment (JRE) 11+ for Elasticsearch
- At least 4GB RAM available for Elasticsearch (8GB+ recommended for production)
## Installation
### 1. Install Elasticsearch
#### On Ubuntu/Debian:
```bash
# Import the Elasticsearch GPG key
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg
# Add the Elasticsearch repository
echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-7.x.list
# Update package list and install
sudo apt-get update
sudo apt-get install elasticsearch=7.17.24
# Hold the package to prevent unwanted upgrades
sudo apt-mark hold elasticsearch
```
#### On CentOS/RHEL:
```bash
# Import the GPG key
sudo rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
# Create repository file
cat <<EOF | sudo tee /etc/yum.repos.d/elasticsearch.repo
[elasticsearch-7.x]
name=Elasticsearch repository for 7.x packages
baseurl=https://artifacts.elastic.co/packages/7.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md
EOF
# Install specific version
sudo yum install elasticsearch-7.17.24
```
### 2. Configure Elasticsearch
#### Basic Configuration
Edit `/etc/elasticsearch/elasticsearch.yml`:
```yaml
# Cluster name (single-node setup)
cluster.name: elasticsearch
# Node name
node.name: node-1
# Network settings for local development
network.host: 127.0.0.1
http.port: 9200
# Discovery settings (single-node)
discovery.type: single-node
# Path settings (default, can be customized)
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
# Security (disabled for local development, enable for production)
xpack.security.enabled: false
```
#### Memory Configuration
Configure JVM heap size in `/etc/elasticsearch/jvm.options.d/heap.options`:
```
# Development: 2-4GB
-Xms2g
-Xmx2g
# Production: 8-16GB (50% of system RAM, max 32GB)
# -Xms16g
# -Xmx16g
```
**Important Memory Guidelines:**
- Set `-Xms` and `-Xmx` to the same value
- Never exceed 50% of total system RAM
- Never exceed 32GB (compressed oops limit)
- Leave at least 50% of RAM for the OS file cache
#### System Limits
The systemd service already configures these limits:
```
LimitNOFILE=65535
LimitNPROC=4096
LimitAS=infinity
```
If running manually, also set in `/etc/security/limits.conf`:
```
elasticsearch soft nofile 65535
elasticsearch hard nofile 65535
elasticsearch soft nproc 4096
elasticsearch hard nproc 4096
```
### 3. Start and Enable Elasticsearch
```bash
# Start Elasticsearch
sudo systemctl start elasticsearch
# Enable to start on boot
sudo systemctl enable elasticsearch
# Check status
sudo systemctl status elasticsearch
# View logs
sudo journalctl -u elasticsearch -f
```
### 4. Verify Installation
```bash
# Test connection
curl http://localhost:9200
# Expected output:
# {
# "name" : "node-1",
# "cluster_name" : "elasticsearch",
# "version" : {
# "number" : "7.17.24",
# ...
# },
# "tagline" : "You Know, for Search"
# }
# Check cluster health
curl http://localhost:9200/_cluster/health?pretty
# Check available indices
curl http://localhost:9200/_cat/indices?v
```
## Laravel Application Configuration
### 1. Environment Variables
Configure Elasticsearch connection in `.env`:
```env
# Search configuration
SCOUT_DRIVER=matchish-elasticsearch
SCOUT_QUEUE=true
SCOUT_PREFIX=
# Elasticsearch connection
ELASTICSEARCH_HOST=localhost:9200
# ELASTICSEARCH_USER=elastic # Uncomment for production with auth
# ELASTICSEARCH_PASSWORD=your_password # Uncomment for production with auth
# Queue for background indexing (recommended)
QUEUE_CONNECTION=redis
```
### 2. Configuration Files
The application has extensive Elasticsearch configuration:
**`config/scout.php`**
- Driver: `matchish-elasticsearch`
- Queue enabled for async indexing
- Chunk size: 500 records per batch
- Soft deletes: Not kept in search index
**`config/elasticsearch.php`**
- Index mappings for all searchable models (825 lines!)
- Language-specific analyzers (NL, EN, FR, DE, ES)
- Custom analyzers for names and locations
- Date format handling
- Field boost configuration
**`config/timebank-cc.php`** (search section)
- Boost factors for fields and models
- Search behavior (type, fragment size, highlighting)
- Maximum results and caching
- Model indices to search
- Suggestion count
### 3. Searchable Models
The following models use Scout's `Searchable` trait:
- **User** → `users_index`
- **Organization** → `organizations_index`
- **Bank** → `banks_index`
- **Post** → `posts_index`
- **Transaction** → `transactions_index`
- **Tag** → `tags_index`
Each model defines:
- `searchableAs()`: Index name
- `toSearchableArray()`: Data structure for indexing
## Index Management
### Creating Indices
Indices are automatically created when you import data:
```bash
# Import all models (creates indices with timestamps)
php artisan scout:import "App\Models\User"
php artisan scout:import "App\Models\Organization"
php artisan scout:import "App\Models\Bank"
php artisan scout:import "App\Models\Post"
# Queue-based import (recommended for large datasets)
php artisan scout:queue-import "App\Models\User"
```
**Index Naming**: Indices are created with timestamps (e.g., `users_index_1758826582`) and aliases are used for stable names.
### Reindexing Script
The application includes a comprehensive reindexing script at `re-index-search.sh`:
```bash
# Run the reindexing script
./re-index-search.sh
```
**What it does:**
1. Cleans up old indices and removes conflicts
2. Waits for cluster health
3. Imports all models (Users, Organizations, Banks, Posts)
4. Creates stable aliases pointing to latest timestamped indices
5. Shows final index and alias status
**Important**: The script uses `SCOUT_QUEUE=false` to force immediate indexing, bypassing the queue for reliable completion.
### Manual Index Operations
```bash
# Flush (delete) an index
php artisan scout:flush "App\Models\User"
# Delete a specific index
php artisan scout:delete-index users_index_1758826582
# Delete all indices
php artisan scout:delete-all-indexes
# Create a new index
php artisan scout:index users_index
# Check indices via curl
curl http://localhost:9200/_cat/indices?v
# Check aliases
curl http://localhost:9200/_cat/aliases?v
```
## Search Features
### Multilingual Search
The configuration supports 5 languages with dedicated analyzers:
**Language Analyzers:**
- `analyzer_nl`: Dutch (stop words + stemming)
- `analyzer_en`: English (stop words + stemming)
- `analyzer_fr`: French (stop words + stemming)
- `analyzer_de`: German (stop words + stemming)
- `analyzer_es`: Spanish (stop words + stemming)
**Special Analyzers:**
- `name_analyzer`: For profile names with edge n-grams (autocomplete)
- `locations_analyzer`: For cities/districts with custom stop words
- `analyzer_general`: Generic tokenization for general text
### Boost Configuration
Field boost factors (configured in `config/timebank-cc.php`):
**Profile Fields:**
```php
'name' => 1,
'full_name' => 1,
'cyclos_skills' => 1.5,
'tags' => 2, // Highest boost
'tag_categories' => 1.4,
'motivation' => 1,
'about_short' => 1,
'about' => 1,
```
**Post Fields:**
```php
'title' => 2, // Highest boost
'excerpt' => 1.5,
'content' => 1,
'post_category_name' => 2, // High boost
```
**Model Boost (score multipliers):**
```php
'user' => 1, // Baseline
'organization' => 3, // 3x boost
'bank' => 3, // 3x boost
'post' => 4, // 4x boost (highest)
```
### Location-Based Search
The application has advanced location boost factors:
```php
'same_district' => 5.0, // Highest boost
'same_city' => 3.0, // High boost
'same_division' => 2.0, // Medium boost
'same_country' => 1.5, // Base boost
'different_country' => 1.0, // Neutral
'no_location' => 0.9, // Slight penalty
```
### Search Highlighting
Search results include highlighted matches:
```php
'fragment_size' => 80, // Characters per fragment
'number_of_fragments' => 2, // Max fragments
'pre-tags' => '<span class="font-semibold text-white leading-tight">',
'post-tags' => '</span>',
```
### Caching
Search results are cached for performance:
```php
'cache_results' => 5, // TTL in minutes
```
## Index Structure Examples
### Users Index Mapping
```json
{
"users_index": {
"properties": {
"id": { "type": "keyword" },
"name": {
"type": "text",
"analyzer": "name_analyzer",
"fields": {
"keyword": { "type": "keyword" },
"suggest": { "type": "completion" }
}
},
"about_nl": { "type": "text", "analyzer": "analyzer_nl" },
"about_en": { "type": "text", "analyzer": "analyzer_en" },
"about_fr": { "type": "text", "analyzer": "analyzer_fr" },
"about_de": { "type": "text", "analyzer": "analyzer_de" },
"about_es": { "type": "text", "analyzer": "analyzer_es" },
"locations": {
"properties": {
"district": { "type": "text", "analyzer": "locations_analyzer" },
"city": { "type": "text", "analyzer": "locations_analyzer" },
"division": { "type": "text", "analyzer": "locations_analyzer" },
"country": { "type": "text", "analyzer": "locations_analyzer" }
}
},
"tags": {
"properties": {
"contexts": {
"properties": {
"tags": {
"properties": {
"name_nl": { "type": "text", "analyzer": "analyzer_nl" },
"name_en": { "type": "text", "analyzer": "analyzer_en" }
// ... other languages
}
}
}
}
}
}
}
}
}
```
### Posts Index Mapping
```json
{
"posts_index": {
"properties": {
"id": { "type": "keyword" },
"category_id": { "type": "integer" },
"status": { "type": "keyword" },
"featured": { "type": "boolean" },
"post_translations": {
"properties": {
"title_nl": {
"type": "text",
"analyzer": "analyzer_nl",
"fields": {
"keyword": { "type": "keyword" },
"suggest": { "type": "completion" }
}
},
"title_en": {
"type": "text",
"analyzer": "analyzer_en",
"fields": {
"keyword": { "type": "keyword" },
"suggest": { "type": "completion" }
}
},
"content_nl": { "type": "text", "analyzer": "analyzer_nl" },
"content_en": { "type": "text", "analyzer": "analyzer_en" },
"from_nl": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||strict_date_optional_time||epoch_millis"
}
// ... other languages and fields
}
}
}
}
}
```
## Troubleshooting
### Elasticsearch Won't Start
**Problem**: Service fails to start
**Solutions**:
1. Check memory settings:
```bash
# View JVM settings
cat /etc/elasticsearch/jvm.options.d/heap.options
# Check available system memory
free -h
# Ensure heap size doesn't exceed 50% of RAM
```
2. Check disk space:
```bash
df -h /var/lib/elasticsearch
```
3. Check logs:
```bash
sudo journalctl -u elasticsearch -n 100 --no-pager
sudo tail -f /var/log/elasticsearch/elasticsearch.log
```
4. Check Java installation:
```bash
java -version
```
### Connection Refused
**Problem**: Cannot connect to Elasticsearch
**Solutions**:
1. Verify Elasticsearch is running:
```bash
sudo systemctl status elasticsearch
```
2. Check port binding:
```bash
ss -tlnp | grep 9200
```
3. Check configuration:
```bash
sudo grep -E "^network.host|^http.port" /etc/elasticsearch/elasticsearch.yml
```
4. Test connection:
```bash
curl http://localhost:9200
```
### Index Not Found
**Problem**: `index_not_found_exception` when searching
**Solutions**:
1. Check if indices exist:
```bash
curl http://localhost:9200/_cat/indices?v
```
2. Check if aliases exist:
```bash
curl http://localhost:9200/_cat/aliases?v
```
3. Reimport the model:
```bash
php artisan scout:import "App\Models\User"
```
4. Or run the full reindex script:
```bash
./re-index-search.sh
```
### Slow Indexing / High Memory Usage
**Problem**: Indexing takes too long or uses excessive memory
**Solutions**:
1. Enable queue for async indexing in `.env`:
```env
SCOUT_QUEUE=true
QUEUE_CONNECTION=redis
```
2. Start queue worker:
```bash
php artisan queue:work --queue=high,default
```
3. Reduce chunk size in `config/scout.php`:
```php
'chunk' => [
'searchable' => 250, // Reduced from 500
],
```
4. Monitor Elasticsearch memory:
```bash
curl http://localhost:9200/_nodes/stats/jvm?pretty
```
### Search Results Are Incorrect
**Problem**: Search doesn't return expected results
**Solutions**:
1. Check index mapping:
```bash
curl http://localhost:9200/users_index/_mapping?pretty
```
2. Test query directly:
```bash
curl -X GET "localhost:9200/users_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match": {
"name": "test"
}
}
}
'
```
3. Clear and rebuild index:
```bash
php artisan scout:flush "App\Models\User"
php artisan scout:import "App\Models\User"
```
4. Check Scout queue jobs:
```bash
php artisan queue:failed
php artisan queue:retry all
```
### Out of Memory Errors
**Problem**: `OutOfMemoryError` in Elasticsearch logs
**Solutions**:
1. Increase JVM heap (but respect limits):
```bash
# Edit /etc/elasticsearch/jvm.options.d/heap.options
-Xms4g
-Xmx4g
```
2. Restart Elasticsearch:
```bash
sudo systemctl restart elasticsearch
```
3. Monitor memory usage:
```bash
watch -n 1 'curl -s http://localhost:9200/_cat/nodes?v&h=heap.percent,ram.percent'
```
4. Clear fielddata cache:
```bash
curl -X POST "localhost:9200/_cache/clear?fielddata=true"
```
### Shards Unassigned
**Problem**: Yellow or red cluster health
**Solutions**:
1. Check cluster health:
```bash
curl http://localhost:9200/_cluster/health?pretty
```
2. Check shard allocation:
```bash
curl http://localhost:9200/_cat/shards?v
```
3. For single-node setup, set replicas to 0:
```bash
curl -X PUT "localhost:9200/_settings" -H 'Content-Type: application/json' -d'
{
"index": {
"number_of_replicas": 0
}
}
'
```
## Production Recommendations
### Security
1. **Enable X-Pack Security**:
Edit `/etc/elasticsearch/elasticsearch.yml`:
```yaml
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
```
2. **Set passwords**:
```bash
/usr/share/elasticsearch/bin/elasticsearch-setup-passwords auto
```
3. **Update `.env`**:
```env
ELASTICSEARCH_USER=elastic
ELASTICSEARCH_PASSWORD=generated_password
```
### Performance Optimization
1. **Increase file descriptors**:
```bash
# /etc/security/limits.conf
elasticsearch soft nofile 65535
elasticsearch hard nofile 65535
```
2. **Disable swapping**:
```bash
# /etc/elasticsearch/elasticsearch.yml
bootstrap.memory_lock: true
```
Edit `/etc/systemd/system/elasticsearch.service.d/override.conf`:
```ini
[Service]
LimitMEMLOCK=infinity
```
3. **Use SSD for data directory**:
```yaml
# /etc/elasticsearch/elasticsearch.yml
path.data: /mnt/ssd/elasticsearch
```
4. **Set appropriate refresh interval**:
```bash
curl -X PUT "localhost:9200/users_index/_settings" -H 'Content-Type: application/json' -d'
{
"index": {
"refresh_interval": "30s"
}
}
'
```
### Backup and Restore
1. **Configure snapshot repository**:
```bash
curl -X PUT "localhost:9200/_snapshot/backup_repo" -H 'Content-Type: application/json' -d'
{
"type": "fs",
"settings": {
"location": "/var/backups/elasticsearch",
"compress": true
}
}
'
```
2. **Create snapshot**:
```bash
curl -X PUT "localhost:9200/_snapshot/backup_repo/snapshot_1?wait_for_completion=true"
```
3. **Restore snapshot**:
```bash
curl -X POST "localhost:9200/_snapshot/backup_repo/snapshot_1/_restore"
```
### Monitoring
1. **Check cluster stats**:
```bash
curl http://localhost:9200/_cluster/stats?pretty
```
2. **Monitor node stats**:
```bash
curl http://localhost:9200/_nodes/stats?pretty
```
3. **Check index stats**:
```bash
curl http://localhost:9200/_stats?pretty
```
4. **Set up monitoring with Kibana** (optional):
```bash
sudo apt-get install kibana=7.17.24
sudo systemctl enable kibana
sudo systemctl start kibana
```
## Quick Reference
### Essential Commands
```bash
# Service management
sudo systemctl start elasticsearch
sudo systemctl stop elasticsearch
sudo systemctl restart elasticsearch
sudo systemctl status elasticsearch
# Check health
curl http://localhost:9200
curl http://localhost:9200/_cluster/health?pretty
curl http://localhost:9200/_cat/indices?v
# Laravel Scout commands
php artisan scout:import "App\Models\User"
php artisan scout:flush "App\Models\User"
php artisan scout:delete-all-indexes
# Reindex everything
./re-index-search.sh
# Queue worker for async indexing
php artisan queue:work --queue=high,default
```
### Configuration Files
- `.env` - Connection and driver configuration
- `config/scout.php` - Laravel Scout settings
- `config/elasticsearch.php` - Index mappings and analyzers (825 lines!)
- `config/timebank-cc.php` - Search boost factors and behavior
- `/etc/elasticsearch/elasticsearch.yml` - Elasticsearch server config
- `/etc/elasticsearch/jvm.options.d/heap.options` - JVM memory settings
- `/usr/lib/systemd/system/elasticsearch.service` - systemd service
### Important Paths
- **Data**: `/var/lib/elasticsearch`
- **Logs**: `/var/log/elasticsearch`
- **Config**: `/etc/elasticsearch`
- **Binary**: `/usr/share/elasticsearch`
## Additional Resources
- **Elasticsearch Documentation**: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/
- **Laravel Scout**: https://laravel.com/docs/10.x/scout
- **Matchish Scout Elasticsearch**: https://github.com/matchish/laravel-scout-elasticsearch
- **Elasticsearch DSL**: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl.html
- **Language Analyzers**: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/analysis-lang-analyzer.html
## Notes
- This application uses a multilingual search setup with custom analyzers
- The `config/elasticsearch.php` file is extensive (825 lines) with detailed field mappings
- Location-based search uses edge n-grams for autocomplete functionality
- Tags and categories have hierarchical support with multilingual translations
- The reindexing script handles index versioning and aliasing automatically
- Memory requirements are significant during indexing (plan accordingly)