This project analyzes user journeys across different website pages on the 365 Data Science website. It includes preprocessing of raw journey data, generation of metrics (page counts, destinations, sequences), and visualizations. text
PYTHON_CAPSTONE/
│
├── pictures/ # Generated visualizations
│ ├── metrics.png
│ ├── pg_count_frequency_dist.png
│ └── pg_presence_freq_dist.png
│
├── processed_dfs/ # Intermediate processed datasets
│ ├── clean_collated_userjourney.csv
│ ├── collated_userjourney_by_userid.csv
│ └── user_journey_duplicates_removed.csv
│
├── results_dfs/ # Final result datasets
│ ├── page_count.csv
│ ├── page_destination.csv
│ ├── page_presence.csv
│ └── page_sequence.csv
│
├── User_journey_raw.csv # Raw input dataset
├── User_journey_analysis.ipynb # Main analysis notebook
├── pyproject.toml # Project dependencies / environment
└── ReadME.md # Project documentation (this file)
-
Raw Data (
User_journey_raw.csv)- Contains the original user journey logs.
-
Processing (
processed_dfs/)clean_collated_userjourney.csv: Cleaned and merged dataset.collated_userjourney_by_userid.csv: User journeys grouped by user.user_journey_duplicates_removed.csv: Journeys with duplicates removed.
-
Results (
results_dfs/)page_count.csv: Frequency of each page visited.page_destination.csv: Most common follow-up pages.page_presence.csv: Presence/absence distribution of pages.page_sequence.csv: Most frequent multi-page sequences.
-
Visualizations (
pictures/)- Page frequency and presence distributions.
- Other derived metrics.
-
Install dependencies
pip install -r requirements.txt
(or use
pyproject.tomlwith Poetry/other env managers) -
Run the Jupyter notebook:
jupyter notebook User_journey_analysis.ipynb
-
Outputs:
- Processed datasets in
processed_dfs/ - Results in
results_dfs/ - Visualizations in
pictures/
- Processed datasets in
- Page Count: Total visits per page.
- Page Presence: Whether a page appears in a user’s journey.
- Page Destination: Most frequent next-page transitions.
- Page Sequence: Most common multi-step sequences (e.g., 3-page runs).
Page count : is the most fundamental metric; it counts how many times each page can be found in all user journeys.
From this image, Log in, Homepage and Checkout pages are the top 3 visited pages on the website while Blog, About us, and Instructors are the least visited pages on the website by total count.

Page presence : is similar to ‘page count’ but counts each page only once if it exists in a journey; it shows how many times each page is part of a journey.
By unique counts, there is clearly a stratified order to the pages frequency with
- General Page and Log in/Sign up pages taking the highest frequencies.
- Pricing & coupons pages following after.
- Courses & certificates taking the third strata
- Information & communication pages been the least of them all.

Page destination : is a metric that shows the most frequent follow-ups after every page. It looks at every page and counts which pages follow next. If one is interested in what the users do after visiting page X, they can consult this metric.
For this metric, I focused on two pages, Courses, and Pricing.
Courses
The highest follow-up page with Courses page is Career Tracks page
showing that students are not just only interested in taking the courses but
also in interested in a career tracks courses.

Pricing
The highest follow-up page with Pricing page is Checkout page
showing a possible high turnover of users to be paying customers.

Page sequences : look at what the most popular run of N pages is. I will consult this metric if I’m interested in the sequence of three (or any other number) pages that most often shows up. Count each sequence only once per journey.

Raw Data: 15.3pages
- This show the average length of pages an individual user in a session visited.
Collated Pages Data: 119.3pages
- This shows the average pages length for all collated sessions of users.
Cleaned_collated Pages Data: 91.4pages
- This shows the average pages length for all collated sessions of users.
having removed redundant pages like
Log inandOthers.
- Add support for longer sequence analysis (N > 3).
- Build a dashboard for interactive exploration.
- Integrate cohort analysis by subscription type and and individual analysis of users using user_id.
(N:B This is a course project data: Data does not necessarrily represent real case scenario)
- Arowosegbe Victor Iyanuoluwa