Greetings! ๐
My name is Mykola Melnyk, and I'm an ML expert with two decades of experience in the software development. I specialize in transforming complex business ideas into scalable, secure, and efficient AI-driven products. I have expert knowledge in various areas, enabling me to deliver cutting-edge, top-tier AI solutions that drive business growth and improve efficiency.
๐ Natural Language Processing (NLP), Computer Vision (CV), and Optical Character Recognition (OCR): 5+ years of experience in document processing, understanding, and anonymization. Led the development of Spark OCR (Visual NLP) using technologies such as Python/Scala, PySpark, PyTorch, LLMs, LLama 3, Mini Gemini, LangChain, and Hugging Face Transformers.
โก Big Data Processing with Apache Spark: 7+ years of experience designing and optimizing large-scale data pipelines for high-performance processing. In-depth knowledge of Spark internals, Spark Structured Streaming, and creator/contributor to the open-source spark-pdf datasource project written in Scala, enhancing Sparkโs capabilities.
๐ Data De-identification & Anonymization: Expert in anonymizing sensitive data from text, images, PDFs, and DICOM files. I ensure privacy, security, and compliance with GDPR and HIPAA standards using NLP, OCR, and computer vision to remove or mask personal information, safeguarding data confidentiality.
๐งฌ Healthcare, Pharma, MedTech, BioTech Expertise: Over 5 years of experience in the healthcare and life sciences sectors, with a strong understanding of formats like DICOM, and expertise in delivering solutions specifically tailored to meet the unique needs of these industries.
โ End-to-End Expertise
โ Complex Problem-Solving Ability
โ Timely Delivery
โ Transparent Communication
โ Scalable Solutions
๐ ๏ธ Programming Languages: Python, Scala
๐ Data Science & Machine Learning: NLP, Computer Vision, Large Language Models (LLMs), Optical Character Recognition (OCR), Model Productionalization, Deep Learning (PyTorch, TensorFlow, Hugging Face Transformers, ONNX, Pandas, CLIP)
๐ก LLMs and Related Tools: OpenAI GPT, Gemini, Llama 3, FLUX, Together.ai, Ollama, Hugging Face, Langchain, LlamaIndex, LangServe, LangGraph, QLORA, Streamlit, Gradio
โก Big Data & Distributed Systems: Big Data Processing, ETL, Stream Processing, Real-Time Aggregation, Apache Spark (PySpark, Spark ML, Spark Structured Streaming), Kinesis, Kafka, Databricks
๐ Cloud Computing & Infrastructure: Amazon Web Services (AWS), Distributed Systems, CI/CD Pipelines, Docker, Jenkins, Graphite, Grafana, Elasticsearch, Kibana
โ๏ธ Databases: PostgreSQL, MongoDB, Redis, DynamoDB
๐ผ CRMs: Hubspot, ZohoCRM
Committed to long-term collaborations. Available full-time for your next project.
Source Code: https://github.com/StabRise/spark-pdf
Home page: https://stabrise.com/spark-pdf/
Quick Start Jupyter Notebook: PdfDataSource.ipynb
The project provides a custom data source for the Apache Spark that allows you to read PDF files into the Spark DataFrame.
- Read PDF documents to the Spark DataFrame
- Support read PDF files lazy per page
- Support big files, up to 10k pages
- Support scanned PDF files (call OCR)
- No need to install Tesseract OCR, it's included in the package
Source Code: https://github.com/StabRise/scaledp
Home page: https://stabrise.com/scaledp/
Quick Start Jupyter Notebook: https://github.com/StabRise/ScaleDP-Tutorials/blob/master/1.QuickStart.ipynb
ScaleDP is an Open-Source Library for processing documents using Apache Spark.
- Load PDF documents/Images
- Extract text from PDF documents/Images
- Extract images from PDF documents
- OCR Images/PDF documents
- Run NER on text extracted from PDF documents/Images
- Visualize NER results







