Files
timebank-cc-public/references/ELASTICSEARCH_SETUP.md
Ronald Huynen 2547717edb Initial commit
2026-03-23 21:37:59 +01:00

22 KiB

Elasticsearch Setup Guide

This guide documents the Elasticsearch search engine setup for full-text search, multilingual content indexing, and location-based queries in the application.

SECURITY

Elasticsearch is a DATABASE and must NEVER be exposed to the internet without proper security!

Default Security Risks

By default, Elasticsearch 7.x ships with:

  • NO authentication - anyone can read/write/delete all data
  • NO encryption - all data transmitted in plain text
  • NO access control - full admin access for anyone who connects

Required Security Configuration

For Development (Local Machine Only):

# /etc/elasticsearch/elasticsearch.yml
network.host: 127.0.0.1        # ONLY localhost - NOT 0.0.0.0!
http.port: 9200
xpack.security.enabled: false  # OK for localhost-only

For Production/Remote Servers:

# /etc/elasticsearch/elasticsearch.yml
network.host: 127.0.0.1        # ONLY localhost - use reverse proxy if needed
http.port: 9200
xpack.security.enabled: true   # REQUIRED for any server accessible remotely
xpack.security.transport.ssl.enabled: true

Verify Your Server Is NOT Exposed

Check what interface Elasticsearch is listening on:

ss -tlnp | grep 9200

SAFE - Should show ONLY localhost addresses:

127.0.0.1:9200              # IPv4 localhost
[::1]:9200                  # IPv6 localhost
[::ffff:127.0.0.1]:9200     # IPv6-mapped IPv4 localhost (also safe!)

Note: The [::ffff:127.0.0.1] format is the IPv6 representation of IPv4 localhost - it's still localhost-only and secure.

DANGER - If you see any of these, YOU ARE EXPOSED:

0.0.0.0:9200          # Listening on ALL interfaces - EXPOSED!
*:9200                # Listening on ALL interfaces - EXPOSED!
YOUR_PUBLIC_IP:9200   # Listening on public IP - EXPOSED!

Test external accessibility:

# From another machine or from the internet
curl http://YOUR_SERVER_IP:9200

# Should get: Connection refused (GOOD!)
# If you get a JSON response - YOU ARE EXPOSED TO THE INTERNET!

What Happens If Exposed?

If Elasticsearch is exposed to the internet without authentication:

  1. Attackers can read all your data (users, emails, private information)
  2. Attackers can delete all your indices (all search data gone)
  3. Attackers can modify data (corrupt your search results)
  4. Attackers can execute scripts (potential remote code execution)

Real-world attacks:

  • Ransomware attacks encrypting Elasticsearch data
  • Mass data exfiltration of exposed databases
  • Bitcoin mining malware installation
  • Complete data deletion with ransom demands

Immediate Actions If You Discover Exposure

  1. IMMEDIATELY stop Elasticsearch:
sudo systemctl stop elasticsearch
  1. Fix the configuration:
sudo nano /etc/elasticsearch/elasticsearch.yml
# Set: network.host: 127.0.0.1
# Set: xpack.security.enabled: true
  1. Enable authentication and set passwords:
sudo /usr/share/elasticsearch/bin/elasticsearch-setup-passwords interactive
  1. Restart with fixed configuration:
sudo systemctl start elasticsearch
  1. Verify it's no longer accessible:
curl http://YOUR_SERVER_IP:9200
# Should show: Connection refused
  1. Review logs for unauthorized access:
sudo grep -i "unauthorized\|access denied\|failed\|401\|403" /var/log/elasticsearch/*.log

Overview

The application uses Elasticsearch 7.17.24 with Laravel Scout for:

  • Full-text search across Users, Organizations, Banks, and Posts
  • Multilingual search with language-specific analyzers (EN, NL, DE, ES, FR)
  • Location-based search with edge n-gram tokenization
  • Skill and tag matching with boost factors
  • Autocomplete suggestions
  • Custom search optimization with configurable boost factors

Scout Driver: matchish/laravel-scout-elasticsearch v7.12.0 Elasticsearch Client: elasticsearch/elasticsearch v8.19.0

Prerequisites

  • PHP 8.3+ with required extensions
  • MySQL/MariaDB database (primary data source)
  • Redis server (for Scout queue)
  • Java Runtime Environment (JRE) 11+ for Elasticsearch
  • At least 4GB RAM available for Elasticsearch (8GB+ recommended for production)

Installation

1. Install Elasticsearch

On Ubuntu/Debian:

# Import the Elasticsearch GPG key
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg

# Add the Elasticsearch repository
echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee /etc/apt/sources.list.d/elastic-7.x.list

# Update package list and install
sudo apt-get update
sudo apt-get install elasticsearch=7.17.24

# Hold the package to prevent unwanted upgrades
sudo apt-mark hold elasticsearch

On CentOS/RHEL:

# Import the GPG key
sudo rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch

# Create repository file
cat <<EOF | sudo tee /etc/yum.repos.d/elasticsearch.repo
[elasticsearch-7.x]
name=Elasticsearch repository for 7.x packages
baseurl=https://artifacts.elastic.co/packages/7.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md
EOF

# Install specific version
sudo yum install elasticsearch-7.17.24

2. Configure Elasticsearch

Basic Configuration

Edit /etc/elasticsearch/elasticsearch.yml:

# Cluster name (single-node setup)
cluster.name: elasticsearch

# Node name
node.name: node-1

# Network settings for local development
network.host: 127.0.0.1
http.port: 9200

# Discovery settings (single-node)
discovery.type: single-node

# Path settings (default, can be customized)
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch

# Security (disabled for local development, enable for production)
xpack.security.enabled: false

Memory Configuration

Configure JVM heap size in /etc/elasticsearch/jvm.options.d/heap.options:

# Development: 2-4GB
-Xms2g
-Xmx2g

# Production: 8-16GB (50% of system RAM, max 32GB)
# -Xms16g
# -Xmx16g

Important Memory Guidelines:

  • Set -Xms and -Xmx to the same value
  • Never exceed 50% of total system RAM
  • Never exceed 32GB (compressed oops limit)
  • Leave at least 50% of RAM for the OS file cache

System Limits

The systemd service already configures these limits:

LimitNOFILE=65535
LimitNPROC=4096
LimitAS=infinity

If running manually, also set in /etc/security/limits.conf:

elasticsearch soft nofile 65535
elasticsearch hard nofile 65535
elasticsearch soft nproc 4096
elasticsearch hard nproc 4096

3. Start and Enable Elasticsearch

# Start Elasticsearch
sudo systemctl start elasticsearch

# Enable to start on boot
sudo systemctl enable elasticsearch

# Check status
sudo systemctl status elasticsearch

# View logs
sudo journalctl -u elasticsearch -f

4. Verify Installation

# Test connection
curl http://localhost:9200

# Expected output:
# {
#   "name" : "node-1",
#   "cluster_name" : "elasticsearch",
#   "version" : {
#     "number" : "7.17.24",
#     ...
#   },
#   "tagline" : "You Know, for Search"
# }

# Check cluster health
curl http://localhost:9200/_cluster/health?pretty

# Check available indices
curl http://localhost:9200/_cat/indices?v

Laravel Application Configuration

1. Environment Variables

Configure Elasticsearch connection in .env:

# Search configuration
SCOUT_DRIVER=matchish-elasticsearch
SCOUT_QUEUE=true
SCOUT_PREFIX=

# Elasticsearch connection
ELASTICSEARCH_HOST=localhost:9200
# ELASTICSEARCH_USER=elastic            # Uncomment for production with auth
# ELASTICSEARCH_PASSWORD=your_password  # Uncomment for production with auth

# Queue for background indexing (recommended)
QUEUE_CONNECTION=redis

2. Configuration Files

The application has extensive Elasticsearch configuration:

config/scout.php

  • Driver: matchish-elasticsearch
  • Queue enabled for async indexing
  • Chunk size: 500 records per batch
  • Soft deletes: Not kept in search index

config/elasticsearch.php

  • Index mappings for all searchable models (825 lines!)
  • Language-specific analyzers (NL, EN, FR, DE, ES)
  • Custom analyzers for names and locations
  • Date format handling
  • Field boost configuration

config/timebank-cc.php (search section)

  • Boost factors for fields and models
  • Search behavior (type, fragment size, highlighting)
  • Maximum results and caching
  • Model indices to search
  • Suggestion count

3. Searchable Models

The following models use Scout's Searchable trait:

  • Userusers_index
  • Organizationorganizations_index
  • Bankbanks_index
  • Postposts_index
  • Transactiontransactions_index
  • Tagtags_index

Each model defines:

  • searchableAs(): Index name
  • toSearchableArray(): Data structure for indexing

Index Management

Creating Indices

Indices are automatically created when you import data:

# Import all models (creates indices with timestamps)
php artisan scout:import "App\Models\User"
php artisan scout:import "App\Models\Organization"
php artisan scout:import "App\Models\Bank"
php artisan scout:import "App\Models\Post"

# Queue-based import (recommended for large datasets)
php artisan scout:queue-import "App\Models\User"

Index Naming: Indices are created with timestamps (e.g., users_index_1758826582) and aliases are used for stable names.

Reindexing Script

The application includes a comprehensive reindexing script at re-index-search.sh:

# Run the reindexing script
./re-index-search.sh

What it does:

  1. Cleans up old indices and removes conflicts
  2. Waits for cluster health
  3. Imports all models (Users, Organizations, Banks, Posts)
  4. Creates stable aliases pointing to latest timestamped indices
  5. Shows final index and alias status

Important: The script uses SCOUT_QUEUE=false to force immediate indexing, bypassing the queue for reliable completion.

Manual Index Operations

# Flush (delete) an index
php artisan scout:flush "App\Models\User"

# Delete a specific index
php artisan scout:delete-index users_index_1758826582

# Delete all indices
php artisan scout:delete-all-indexes

# Create a new index
php artisan scout:index users_index

# Check indices via curl
curl http://localhost:9200/_cat/indices?v

# Check aliases
curl http://localhost:9200/_cat/aliases?v

Search Features

The configuration supports 5 languages with dedicated analyzers:

Language Analyzers:

  • analyzer_nl: Dutch (stop words + stemming)
  • analyzer_en: English (stop words + stemming)
  • analyzer_fr: French (stop words + stemming)
  • analyzer_de: German (stop words + stemming)
  • analyzer_es: Spanish (stop words + stemming)

Special Analyzers:

  • name_analyzer: For profile names with edge n-grams (autocomplete)
  • locations_analyzer: For cities/districts with custom stop words
  • analyzer_general: Generic tokenization for general text

Boost Configuration

Field boost factors (configured in config/timebank-cc.php):

Profile Fields:

'name' => 1,
'full_name' => 1,
'cyclos_skills' => 1.5,
'tags' => 2,              // Highest boost
'tag_categories' => 1.4,
'motivation' => 1,
'about_short' => 1,
'about' => 1,

Post Fields:

'title' => 2,                  // Highest boost
'excerpt' => 1.5,
'content' => 1,
'post_category_name' => 2,     // High boost

Model Boost (score multipliers):

'user' => 1,           // Baseline
'organization' => 3,   // 3x boost
'bank' => 3,           // 3x boost
'post' => 4,           // 4x boost (highest)

The application has advanced location boost factors:

'same_district' => 5.0,    // Highest boost
'same_city' => 3.0,        // High boost
'same_division' => 2.0,    // Medium boost
'same_country' => 1.5,     // Base boost
'different_country' => 1.0, // Neutral
'no_location' => 0.9,      // Slight penalty

Search Highlighting

Search results include highlighted matches:

'fragment_size' => 80,           // Characters per fragment
'number_of_fragments' => 2,      // Max fragments
'pre-tags' => '<span class="font-semibold text-white leading-tight">',
'post-tags' => '</span>',

Caching

Search results are cached for performance:

'cache_results' => 5,  // TTL in minutes

Index Structure Examples

Users Index Mapping

{
  "users_index": {
    "properties": {
      "id": { "type": "keyword" },
      "name": {
        "type": "text",
        "analyzer": "name_analyzer",
        "fields": {
          "keyword": { "type": "keyword" },
          "suggest": { "type": "completion" }
        }
      },
      "about_nl": { "type": "text", "analyzer": "analyzer_nl" },
      "about_en": { "type": "text", "analyzer": "analyzer_en" },
      "about_fr": { "type": "text", "analyzer": "analyzer_fr" },
      "about_de": { "type": "text", "analyzer": "analyzer_de" },
      "about_es": { "type": "text", "analyzer": "analyzer_es" },
      "locations": {
        "properties": {
          "district": { "type": "text", "analyzer": "locations_analyzer" },
          "city": { "type": "text", "analyzer": "locations_analyzer" },
          "division": { "type": "text", "analyzer": "locations_analyzer" },
          "country": { "type": "text", "analyzer": "locations_analyzer" }
        }
      },
      "tags": {
        "properties": {
          "contexts": {
            "properties": {
              "tags": {
                "properties": {
                  "name_nl": { "type": "text", "analyzer": "analyzer_nl" },
                  "name_en": { "type": "text", "analyzer": "analyzer_en" }
                  // ... other languages
                }
              }
            }
          }
        }
      }
    }
  }
}

Posts Index Mapping

{
  "posts_index": {
    "properties": {
      "id": { "type": "keyword" },
      "category_id": { "type": "integer" },
      "status": { "type": "keyword" },
      "featured": { "type": "boolean" },
      "post_translations": {
        "properties": {
          "title_nl": {
            "type": "text",
            "analyzer": "analyzer_nl",
            "fields": {
              "keyword": { "type": "keyword" },
              "suggest": { "type": "completion" }
            }
          },
          "title_en": {
            "type": "text",
            "analyzer": "analyzer_en",
            "fields": {
              "keyword": { "type": "keyword" },
              "suggest": { "type": "completion" }
            }
          },
          "content_nl": { "type": "text", "analyzer": "analyzer_nl" },
          "content_en": { "type": "text", "analyzer": "analyzer_en" },
          "from_nl": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||strict_date_optional_time||epoch_millis"
          }
          // ... other languages and fields
        }
      }
    }
  }
}

Troubleshooting

Elasticsearch Won't Start

Problem: Service fails to start

Solutions:

  1. Check memory settings:
# View JVM settings
cat /etc/elasticsearch/jvm.options.d/heap.options

# Check available system memory
free -h

# Ensure heap size doesn't exceed 50% of RAM
  1. Check disk space:
df -h /var/lib/elasticsearch
  1. Check logs:
sudo journalctl -u elasticsearch -n 100 --no-pager
sudo tail -f /var/log/elasticsearch/elasticsearch.log
  1. Check Java installation:
java -version

Connection Refused

Problem: Cannot connect to Elasticsearch

Solutions:

  1. Verify Elasticsearch is running:
sudo systemctl status elasticsearch
  1. Check port binding:
ss -tlnp | grep 9200
  1. Check configuration:
sudo grep -E "^network.host|^http.port" /etc/elasticsearch/elasticsearch.yml
  1. Test connection:
curl http://localhost:9200

Index Not Found

Problem: index_not_found_exception when searching

Solutions:

  1. Check if indices exist:
curl http://localhost:9200/_cat/indices?v
  1. Check if aliases exist:
curl http://localhost:9200/_cat/aliases?v
  1. Reimport the model:
php artisan scout:import "App\Models\User"
  1. Or run the full reindex script:
./re-index-search.sh

Slow Indexing / High Memory Usage

Problem: Indexing takes too long or uses excessive memory

Solutions:

  1. Enable queue for async indexing in .env:
SCOUT_QUEUE=true
QUEUE_CONNECTION=redis
  1. Start queue worker:
php artisan queue:work --queue=high,default
  1. Reduce chunk size in config/scout.php:
'chunk' => [
    'searchable' => 250,  // Reduced from 500
],
  1. Monitor Elasticsearch memory:
curl http://localhost:9200/_nodes/stats/jvm?pretty

Search Results Are Incorrect

Problem: Search doesn't return expected results

Solutions:

  1. Check index mapping:
curl http://localhost:9200/users_index/_mapping?pretty
  1. Test query directly:
curl -X GET "localhost:9200/users_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "match": {
      "name": "test"
    }
  }
}
'
  1. Clear and rebuild index:
php artisan scout:flush "App\Models\User"
php artisan scout:import "App\Models\User"
  1. Check Scout queue jobs:
php artisan queue:failed
php artisan queue:retry all

Out of Memory Errors

Problem: OutOfMemoryError in Elasticsearch logs

Solutions:

  1. Increase JVM heap (but respect limits):
# Edit /etc/elasticsearch/jvm.options.d/heap.options
-Xms4g
-Xmx4g
  1. Restart Elasticsearch:
sudo systemctl restart elasticsearch
  1. Monitor memory usage:
watch -n 1 'curl -s http://localhost:9200/_cat/nodes?v&h=heap.percent,ram.percent'
  1. Clear fielddata cache:
curl -X POST "localhost:9200/_cache/clear?fielddata=true"

Shards Unassigned

Problem: Yellow or red cluster health

Solutions:

  1. Check cluster health:
curl http://localhost:9200/_cluster/health?pretty
  1. Check shard allocation:
curl http://localhost:9200/_cat/shards?v
  1. For single-node setup, set replicas to 0:
curl -X PUT "localhost:9200/_settings" -H 'Content-Type: application/json' -d'
{
  "index": {
    "number_of_replicas": 0
  }
}
'

Production Recommendations

Security

  1. Enable X-Pack Security:

Edit /etc/elasticsearch/elasticsearch.yml:

xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
  1. Set passwords:
/usr/share/elasticsearch/bin/elasticsearch-setup-passwords auto
  1. Update .env:
ELASTICSEARCH_USER=elastic
ELASTICSEARCH_PASSWORD=generated_password

Performance Optimization

  1. Increase file descriptors:
# /etc/security/limits.conf
elasticsearch soft nofile 65535
elasticsearch hard nofile 65535
  1. Disable swapping:
# /etc/elasticsearch/elasticsearch.yml
bootstrap.memory_lock: true

Edit /etc/systemd/system/elasticsearch.service.d/override.conf:

[Service]
LimitMEMLOCK=infinity
  1. Use SSD for data directory:
# /etc/elasticsearch/elasticsearch.yml
path.data: /mnt/ssd/elasticsearch
  1. Set appropriate refresh interval:
curl -X PUT "localhost:9200/users_index/_settings" -H 'Content-Type: application/json' -d'
{
  "index": {
    "refresh_interval": "30s"
  }
}
'

Backup and Restore

  1. Configure snapshot repository:
curl -X PUT "localhost:9200/_snapshot/backup_repo" -H 'Content-Type: application/json' -d'
{
  "type": "fs",
  "settings": {
    "location": "/var/backups/elasticsearch",
    "compress": true
  }
}
'
  1. Create snapshot:
curl -X PUT "localhost:9200/_snapshot/backup_repo/snapshot_1?wait_for_completion=true"
  1. Restore snapshot:
curl -X POST "localhost:9200/_snapshot/backup_repo/snapshot_1/_restore"

Monitoring

  1. Check cluster stats:
curl http://localhost:9200/_cluster/stats?pretty
  1. Monitor node stats:
curl http://localhost:9200/_nodes/stats?pretty
  1. Check index stats:
curl http://localhost:9200/_stats?pretty
  1. Set up monitoring with Kibana (optional):
sudo apt-get install kibana=7.17.24
sudo systemctl enable kibana
sudo systemctl start kibana

Quick Reference

Essential Commands

# Service management
sudo systemctl start elasticsearch
sudo systemctl stop elasticsearch
sudo systemctl restart elasticsearch
sudo systemctl status elasticsearch

# Check health
curl http://localhost:9200
curl http://localhost:9200/_cluster/health?pretty
curl http://localhost:9200/_cat/indices?v

# Laravel Scout commands
php artisan scout:import "App\Models\User"
php artisan scout:flush "App\Models\User"
php artisan scout:delete-all-indexes

# Reindex everything
./re-index-search.sh

# Queue worker for async indexing
php artisan queue:work --queue=high,default

Configuration Files

  • .env - Connection and driver configuration
  • config/scout.php - Laravel Scout settings
  • config/elasticsearch.php - Index mappings and analyzers (825 lines!)
  • config/timebank-cc.php - Search boost factors and behavior
  • /etc/elasticsearch/elasticsearch.yml - Elasticsearch server config
  • /etc/elasticsearch/jvm.options.d/heap.options - JVM memory settings
  • /usr/lib/systemd/system/elasticsearch.service - systemd service

Important Paths

  • Data: /var/lib/elasticsearch
  • Logs: /var/log/elasticsearch
  • Config: /etc/elasticsearch
  • Binary: /usr/share/elasticsearch

Additional Resources

Notes

  • This application uses a multilingual search setup with custom analyzers
  • The config/elasticsearch.php file is extensive (825 lines) with detailed field mappings
  • Location-based search uses edge n-grams for autocomplete functionality
  • Tags and categories have hierarchical support with multilingual translations
  • The reindexing script handles index versioning and aliasing automatically
  • Memory requirements are significant during indexing (plan accordingly)