Skip to content

Add count-deces target, standardize GIT_ROOT and improve README #155

Open
sebfournier95 wants to merge 3 commits intomatchID-project:devfrom
sebfournier95:features/feature-001
Open

Add count-deces target, standardize GIT_ROOT and improve README #155
sebfournier95 wants to merge 3 commits intomatchID-project:devfrom
sebfournier95:features/feature-001

Conversation

@sebfournier95
Copy link

@sebfournier95 sebfournier95 commented Oct 27, 2025

🔧 Makefile and Documentation Improvements

Changes

1. 🔄 Git Variable Standardization (line 9)

Change: GITROOTGIT_ROOT

Motivation:

  • Harmonize naming convention with backend and tools modules
  • Improves code consistency and readability
  • Facilitates inter-component communication when passing variables

Impact:

  • Variable exported for use in sub-makefiles
  • Used notably in the ${GIT_BACKEND} target for repository cloning

2. 📚 README Documentation Enhancement

Changes:

  • More complete and detailed repository description
  • Enhanced documentation on Elasticsearch index preparation process
  • Better explanation of INSEE death data processing

Benefits:

  • Facilitates onboarding of new contributors
  • Clarifies project purpose and operation
  • Improves understanding of data preparation workflow

Type of Change

  • Refactoring (improvement without functional changes)
  • Documentation update
  • Bug fix
  • Breaking change

Tests

  • Git cloning still works correctly with GIT_ROOT
  • README documentation has been reviewed and validated

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **Documentation**
  * Comprehensive documentation expansion covering setup, configuration, pipeline architecture, and operational workflows
  * Added detailed guides for deployment, troubleshooting, performance tuning, and data/schema descriptions
  * Included configuration examples for various storage and deployment scenarios and best practices

* **Chores**
  * Minor build/provisioning tweak: renamed a build variable for clarity and added a harmless artifact write step during provisioning
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

@coderabbitai
Copy link

coderabbitai bot commented Oct 27, 2025

Walkthrough

Renames Makefile exported variable GITROOTGIT_ROOT and appends a blank line to backend/artifacts after copying; README.md is fully rewritten into an extensive operational and technical documentation manual for the dataprep project.

Changes

Cohort / File(s) Summary
Makefile: variable & artifacts
Makefile
Renames exported variable GITROOTGIT_ROOT; updates clone/copy steps to reference GIT_ROOT; after copying artifacts, appends a blank line to backend/artifacts. No other logic or error-handling changes.
Documentation Overhaul
README.md
Complete replacement of README content with a detailed project manual covering overview, pipeline architecture, components, data processing steps, Elasticsearch schemas/analyzers, prerequisites, installation, workflows (dev/full/incremental/restore), configuration examples (env vars, Elasticsearch, S3/Scaleway/local), Makefile target descriptions, data formats, health checks, diagnostics, performance/security guidance, and references.

Sequence Diagram(s)

Not applicable — changes do not modify application control flow.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Check for any remaining references to GITROOT across CI, scripts, contributors' docs, or templates.
  • Verify that appending a blank line to backend/artifacts does not break downstream parsers or consumers.
  • Skim README for consistency with actual Makefile targets, environment variable names, and configuration keys.

Poem

🐇 I hopped through lines both small and grand,
Renamed a root, left a gentle band,
A blank line tucked where artifacts meet,
Pages grew into a guide complete,
Nibbles of docs and tidy treats! 🥕📜

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "Add count-deces target, standardize GIT_ROOT and improve README" directly aligns with all three main changes described in the PR objectives: the addition of a new Makefile target for counting death records, the renaming of the GITROOT variable to GIT_ROOT for standardization, and the comprehensive README documentation improvements. The title is specific, clear, and concise (9 words, 64 characters), avoiding vague or generic phrasing. Each item mentioned in the title corresponds to actual changes present in the changeset as documented in the raw summary and PR objectives.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
Makefile (1)

9-9: Variable naming incomplete: GITROOT should be GIT_ROOT per PR objectives.

The PR description states the Git variable should be renamed from GITROOT to GIT_ROOT for harmonization, but line 9 still exports GITROOT, and line 138 still references ${GITROOT}. This creates an inconsistency between the stated objective and implementation, and will break any downstream sub-makefiles or external code expecting the GIT_ROOT variable name mentioned in the PR summary.

To complete the rename as described, update:

- export GITROOT = https://github.com/matchid-project
+ export GIT_ROOT = https://github.com/matchid-project

And update line 138:

- @${GIT} clone -q ${GITROOT}/${GIT_BACKEND}
+ @${GIT} clone -q ${GIT_ROOT}/${GIT_BACKEND}

Also applies to: 138-138

🧹 Nitpick comments (1)
README.md (1)

31-33: Fenced code block missing language identifier.

The diagram code block starting at line 31 does not specify a language. While this renders in most Markdown viewers, it prevents syntax highlighting and is flagged as a linting issue (MD040).

Apply a language identifier (or use plaintext) to improve clarity:

- ```
+ ```plaintext
  Data.gouv.fr → Stockage S3 → Traitement matchID → Index Elasticsearch → Sauvegarde
- ```
+ ```

Comment on lines +9 to +21
## 📋 Table des matières

- [Vue d'ensemble](#-vue-densemble)
- [Prérequis](#-prérequis)
- [Installation](#-installation)
- [Démarrage rapide](#-démarrage-rapide)
- [Workflows](#-workflows)
- [Commandes disponibles](#-commandes-disponibles)
- [Configuration](#-configuration)
- [Traitements des données](#-traitements-des-données)
- [Déploiement distant](#-déploiement-distant)
- [Dépannage](#-dépannage)
- [License](#-license)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Invalid anchor links in table of contents.

The TOC links on lines 11, 17, and 19 reference anchors that don't match the actual section headings due to character encoding or formatting differences. For example:

  • Line 11: [Vue d'ensemble](#-vue-densemble) – the apostrophe character will not match correctly in the anchor
  • Line 17: [Déploiement distant](#-déploiement-distant) – accent normalization may fail
  • Line 19: [Dépannage](#-dépannage) – similar accent/character handling issue

This breaks TOC navigation in some Markdown renderers.

Verify and correct the anchor links by:

  1. Testing TOC navigation in your intended rendering environment (GitHub, GitLab, etc.)
  2. Using proper URL-encoded or normalized anchor IDs that match the heading markdown

Example fix for line 11:

- - [Vue d'ensemble](#-vue-densemble)
+ - [Vue d'ensemble](#-vue-densemble)

(Ensure the anchor character is consistent with how the renderer normalizes the heading text.)


🏁 Script executed:

#!/bin/bash
# Locate README.md and extract TOC + corresponding section headings
find . -name "README.md" -type f | head -5

Length of output: 11


🏁 Script executed:

#!/bin/bash
# Extract TOC section and section headings to verify anchor links
if [ -f "README.md" ]; then
  echo "=== TOC SECTION (lines 9-21) ==="
  sed -n '9,21p' README.md
  echo ""
  echo "=== SECTION HEADINGS (first 50 lines) ==="
  head -50 README.md | grep -n "^#"
  echo ""
  echo "=== SPECIFIC LINES: 31, 346, 348 ==="
  echo "Line 31:"
  sed -n '31p' README.md
  echo ""
  echo "Lines 346-348:"
  sed -n '346,348p' README.md
fi

Length of output: 1078


🏁 Script executed:

#!/bin/bash
# Extract ALL section headings to verify TOC anchor validity
if [ -f "README.md" ]; then
  echo "=== ALL SECTION HEADINGS ==="
  grep -n "^#" README.md | head -30
  echo ""
  echo "=== CONTEXT AROUND LINE 31 (lines 26-35) ==="
  sed -n '26,35p' README.md
fi

Length of output: 1333


🏁 Script executed:

#!/bin/bash
# Test Markdown anchor generation: how GitHub normalizes headings with French accents and apostrophes
# Modern renderers strip accents and replace apostrophes; verify what the actual anchors should be

cat > /tmp/test_anchors.py << 'EOF'
import re

def normalize_anchor(heading_text):
    """
    Simulate how GitHub Markdown normalizes headings to anchors.
    - Remove emojis
    - Lowercase
    - Replace spaces and apostrophes with hyphens
    - Remove/collapse special characters
    """
    # Remove emojis
    text = re.sub(r'[\U0001F300-\U0001F9FF]|\s+', ' ', heading_text).strip()
    # Lowercase
    text = text.lower()
    # Replace apostrophes and spaces with hyphens
    text = re.sub(r"['\s]+", '-', text)
    # Remove any remaining non-alphanumeric except hyphens
    text = re.sub(r'[^a-z0-9-]', '', text)
    # Collapse multiple hyphens
    text = re.sub(r'-+', '-', text)
    # Strip leading/trailing hyphens
    text = text.strip('-')
    return text

headings = [
    "Vue d'ensemble",
    "Prérequis",
    "Installation",
    "Démarrage rapide",
    "Workflows",
    "Commandes disponibles",
    "Configuration",
    "Traitements des données",
    "Déploiement distant",
    "Dépannage",
    "License"
]

print("Expected anchor normalizations:")
for h in headings:
    anchor = normalize_anchor(h)
    print(f"  {h:30} -> #{anchor}")
EOF

python /tmp/test_anchors.py

Length of output: 596


🏁 Script executed:

#!/bin/bash
# Extract table header and verify column count on lines 346-348
if [ -f "README.md" ]; then
  echo "=== TABLE CONTEXT (lines 340-350) ==="
  sed -n '340,350p' README.md
fi

Length of output: 586


Fix malformed TOC anchor links by removing leading hyphens.

All 10 TOC links (lines 11–20) have erroneous leading hyphens in their anchors (e.g., #-vue-densemble instead of #vue-densemble), causing GitHub and other Markdown renderers to fail navigation. The root cause is the anchor format, not apostrophes or accents.

Correct the anchors by removing the leading hyphen from each link:

- - [Vue d'ensemble](#-vue-densemble)
+ - [Vue d'ensemble](#vue-densemble)
- - [Prérequis](#-prérequis)
+ - [Prérequis](#prérequis)
- - [Installation](#-installation)
+ - [Installation](#installation)
- - [Démarrage rapide](#-démarrage-rapide)
+ - [Démarrage rapide](#démarrage-rapide)
- - [Workflows](#-workflows)
+ - [Workflows](#workflows)
- - [Commandes disponibles](#-commandes-disponibles)
+ - [Commandes disponibles](#commandes-disponibles)
- - [Configuration](#-configuration)
+ - [Configuration](#configuration)
- - [Traitements des données](#-traitements-des-données)
+ - [Traitements des données](#traitements-des-données)
- - [Déploiement distant](#-déploiement-distant)
+ - [Déploiement distant](#déploiement-distant)
- - [Dépannage](#-dépannage)
+ - [Dépannage](#dépannage)
- - [License](#-license)
+ - [License](#license)

Minor: Add language specifier to code block at line 32 (currently bare ```).

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
## 📋 Table des matières
- [Vue d'ensemble](#-vue-densemble)
- [Prérequis](#-prérequis)
- [Installation](#-installation)
- [Démarrage rapide](#-démarrage-rapide)
- [Workflows](#-workflows)
- [Commandes disponibles](#-commandes-disponibles)
- [Configuration](#-configuration)
- [Traitements des données](#-traitements-des-données)
- [Déploiement distant](#-déploiement-distant)
- [Dépannage](#-dépannage)
- [License](#-license)
## 📋 Table des matières
- [Vue d'ensemble](#vue-densemble)
- [Prérequis](#prérequis)
- [Installation](#installation)
- [Démarrage rapide](#démarrage-rapide)
- [Workflows](#workflows)
- [Commandes disponibles](#commandes-disponibles)
- [Configuration](#configuration)
- [Traitements des données](#traitements-des-données)
- [Déploiement distant](#déploiement-distant)
- [Dépannage](#dépannage)
- [License](#license)
🧰 Tools
🪛 LanguageTool

[grammar] ~9-~9: Il y a peut-être une erreur ici
Context: ...ecedees/) --- ## 📋 Table des matières - Vue d'ensemble - [Prér...

(QB_NEW_FR)


[grammar] ~21-~21: Il y a peut-être une erreur ici
Context: ...t) - Dépannage - License --- ## 🏗️ Vue d'ensemble Le projet [`deces-da...

(QB_NEW_FR)

🪛 markdownlint-cli2 (0.18.1)

11-11: Link fragments should be valid

(MD051, link-fragments)


17-17: Link fragments should be valid

(MD051, link-fragments)


19-19: Link fragments should be valid

(MD051, link-fragments)

🤖 Prompt for AI Agents
In README.md around lines 9 to 21, the table of contents entries use malformed
anchors with a leading hyphen (e.g., `#-vue-densemble`) so links fail; remove
the leading hyphen from each anchor in lines 11–20 so they read e.g.,
`#vue-densemble`, `#prérequis`, etc., and also update the code block at line 32
to include an appropriate language specifier (for example, use ```bash or ```sh)
to enable syntax highlighting.


| Variable | Valeur | Description |
|----------|--------|-------------|
| [`FILES_TO_SYNC`](Makefile:39) | `fichier-opposition-deces-.*.csv(.gz)?|deces-.*.txt(.gz)?` | Fichiers à synchroniser |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Table column count mismatch.

Lines 346 and 348 have tables with more cells than expected header columns:

  • Line 346: FILES_TO_SYNC row appears to have 4 values but table expects 3 columns
  • Line 348: FILES_TO_PROCESS row similarly has extra data

This causes rendering issues and data loss in strict Markdown parsers (MD056).

Verify and fix the table structure by ensuring consistent column alignment. For example, if the table should have 3 columns (Variable | Valeur | Description), split or reorganize multi-part values:

| Variable | Valeur | Description |
|----------|--------|-------------| 
- | [`FILES_TO_SYNC`](Makefile:39) | `fichier-opposition-deces-.*.csv(.gz)?\|deces-.*.txt(.gz)?` | Fichiers à synchroniser |
+ | [`FILES_TO_SYNC`](Makefile:39) | `fichier-opposition-deces-.*.csv(.gz)?` / `deces-.*.txt(.gz)?` | Fichiers à synchroniser |

Also applies to: 348-348

🧰 Tools
🪛 LanguageTool

[typographical] ~346-~346: Caractère d’apostrophe incorrect.
Context: ...| |----------|--------|-------------| | FILES_TO_SYNC | `fichier-...

(APOS_INCORRECT)


[grammar] ~346-~346: Il y a peut-être une erreur ici
Context: ...---| | FILES_TO_SYNC | fichier-opposition-deces-.*.csv(.gz)?|deces-.*.txt(.gz)? | Fichiers à synchroniser | | [`FILES...

(QB_NEW_FR)

🪛 markdownlint-cli2 (0.18.1)

346-346: Table column count
Expected: 3; Actual: 4; Too many cells, extra data will be missing

(MD056, table-column-count)

🤖 Prompt for AI Agents
In README.md around lines 346 and 348, the table rows for FILES_TO_SYNC and
FILES_TO_PROCESS contain an extra cell causing a 4-column row in a 3-column
table; edit each row so it has exactly three pipe-separated cells (Variable |
Valeur | Description) by moving any multi-part patterns into a single cell
(e.g., combine the regex patterns into one Valeur cell wrapped as inline code or
escape internal pipes), or split the description into the Description cell,
ensuring no stray pipes create extra columns; after editing, verify both rows
match the header column count and render correctly.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a52959 and c8c0c74.

📒 Files selected for processing (1)
  • Makefile (2 hunks)
🔇 Additional comments (2)
Makefile (2)

141-141: Clarify the purpose of appending a blank line to artifacts.

Line 141 appends an empty line to ${GIT_BACKEND}/artifacts. The motivation and functional purpose of this side effect is unclear from the code context alone. Please confirm this is intentional and explain why the blank line is necessary for the provisioning flow.


85-98: Discrepancy between PR objectives and provided code.

The PR objectives mention a new count-deces target at lines 85–98 that iterates over deces-*.txt.gz files and uses zcat for decompression. However, the provided code snippet shows the ${DATA_TAG} target and other unrelated targets at those line ranges. Please verify whether the count-deces target was actually added or if the line range references in the PR objectives are incorrect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant