Skip to content

MoAz06/data-processing-pipelines

Repository files navigation

Data Processing Pipelines

A collection of data-processing pipelines built using Unix shell tools and Python. These pipelines are designed to efficiently process large CSV datasets using command-line utilities combined with lightweight Python scripts.

Features

  • Data filtering and transformation using Unix tools (grep, awk, cut, sort)
  • Stream processing for large datasets (no full in-memory loading)
  • Integration of Python scripts for calculations and aggregation
  • Modular pipeline structure for reuse and extension
  • Efficient handling of compressed files (e.g. .bz2)

Tech Stack

  • Bash / Unix Shell
  • Python
  • Core Unix utilities (grep, awk, sed, sort, uniq, cut)

Project Structure

. ├── pipeline1.sh ├── pipeline2.sh ├── pipeline3.sh ├── pipeline4.sh ├── pipeline5.py ├── pipeline6.py └── README.md

Usage

Make shell scripts executable:

chmod +x pipeline*.sh

Run a pipeline:

./pipeline1.sh < input.csv

Example with compressed data:

bzcat dataset.csv.bz2 | ./pipeline4.sh

Run Python scripts:

python3 pipeline5.py

Example

bzcat data.csv.bz2 | ./pipeline4.sh | python3 pipeline5.py

Notes

This project was developed as part of a university course (Programmeertechnieken). The focus is on building efficient, composable data pipelines using standard Unix tools.

Future Improvements

  • Add more robust error handling
  • Support for additional file formats (JSON, Parquet)
  • Performance benchmarking
  • Packaging as a reusable CLI tool

Author

Mohamed Azahrioui

Notes

This project was originally developed in a private university GitLab environment (LIACS).
The commit history here is limited because the project was later migrated to GitHub.

The focus of this repository is to showcase the final implementation and functionality.

About

Shell and Python data-processing pipelines for large CSV datasets (Unix tools + Python)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors