Code commits

OCL API 2

Michaël Bontyes 24 Jul 2025 a55e3ce3b031f90c5fe14b07d43e3f6b6cff6ba9

Merge branch 'dev' of https://github.com/OpenConceptLab/oclapi2 into dev
Michaël Bontyes 16 Jul 2025 135e4d3b007e650c2ca60fd5dfae29a1a181fdb3 m

ES score normalization
## PR: Normalize and Standardize Elasticsearch Search Scores

### Problem / What Was Missing

- **Inconsistent Scoring:**
  Elasticsearch’s raw `_score` is not normalized and varies widely between queries, indices, and even similar queries. This caused:
  - Difficulty comparing scores across different queries or result sets.
  - Unstable thresholds for “high confidence” or “best match.”
  - Confusing or misleading confidence displays for users.

- **Downstream Usage Issues:**
  - Confidence buckets and thresholds were based on the raw `_score`, which is not absolute.
  - The API and UI only exposed the raw score, not a normalized or percentage-based value.

---

### What Was Implemented

#### 1. **Score Normalization**
- **Min-Max Normalization:**
  For each search result set, the code now computes the minimum and maximum `_score` values. Each result’s score is normalized to a 0–1 range:
  ```
  normalized_score = (raw_score - min_score) / (max_score - min_score)
  ```
  - Handles edge cases where all scores are the same.

#### 2. **API/Serializer Enhancements**
- **Expose Both Scores:**
  The API now returns both the raw `_score` and the normalized score (`search_score` and `search_normalized_score`).
- **Confidence Calculation:**
  The `search_confidence` field is now based on the normalized score, providing a consistent percentage (e.g., “87.5%”) regardless of the raw score range.

#### 3. **Downstream Logic Updates**
- **Thresholds and Buckets:**
  All logic for “high confidence,” “very high match,” and bucketing now uses the normalized score, so thresholds are stable (e.g., 0.8 always means “top 20%”).
- **Legacy Fallback:**
  If a normalized score is not available, the code falls back to the old raw score logic.

#### 4. **Documentation in Code**
- **Comments and Structure:**
  The code is now clear about which score is being used and why, making it easier for future maintainers to understand the normalization process.

---

### Why This Makes Scores More Consistent

- **Stable Range:**
  All scores are now in a 0–1 range, so thresholds and confidence levels are meaningful and comparable across queries.
- **User-Friendly Confidence:**
  Users and downstream consumers can interpret confidence as a percentage, not an arbitrary number.
- **Easier Tuning:**
  Product and engineering teams can set thresholds (e.g., “show only results with confidence > 70%”) without worrying about the quirks of Elasticsearch’s raw scoring.
- **Future-Proof:**
  If the underlying Elasticsearch configuration changes, the normalization ensures the API and UI remain stable.

---

### Summary Table

| Field | Before (raw) | After (normalized) |
|---------------------------|--------------|--------------------|
| `search_score` | Raw float | Raw float |
| `search_normalized_score` | N/A | 0–1 float |
| `search_confidence` | % of max raw | % of normalized |
| Thresholds/Buckets | Raw-based | Normalized-based |

---

**In summary:**
This PR makes search scoring more robust, interpretable, and consistent for all users and downstream systems.

Build: #179 failed Changes by Michaël Bontyes

Stages & jobs

Build

Test

Deploy for testing

Release

Code commits

OCL API 2

Michaël Bontyes 24 Jul 2025 a55e3ce3b031f90c5fe14b07d43e3f6b6cff6ba9

Michaël Bontyes 16 Jul 2025 135e4d3b007e650c2ca60fd5dfae29a1a181fdb3 m

Build #179

Build: #179 failed Changes by Michaël Bontyes

Stages & jobs

Build

Test

Deploy for testing

Release

Code commits

OCL API 2

Michaël Bontyes 24 Jul 2025 a55e3ce3b031f90c5fe14b07d43e3f6b6cff6ba9

Michaël Bontyes 16 Jul 2025 135e4d3b007e650c2ca60fd5dfae29a1a181fdb3 m