A collection of data-processing pipelines built using Unix shell tools and Python. These pipelines are designed to efficiently process large CSV datasets using command-line utilities combined with lightweight Python scripts.
- Data filtering and transformation using Unix tools (grep, awk, cut, sort)
- Stream processing for large datasets (no full in-memory loading)
- Integration of Python scripts for calculations and aggregation
- Modular pipeline structure for reuse and extension
- Efficient handling of compressed files (e.g. .bz2)
- Bash / Unix Shell
- Python
- Core Unix utilities (grep, awk, sed, sort, uniq, cut)
. ├── pipeline1.sh ├── pipeline2.sh ├── pipeline3.sh ├── pipeline4.sh ├── pipeline5.py ├── pipeline6.py └── README.md
Make shell scripts executable:
chmod +x pipeline*.sh
Run a pipeline:
./pipeline1.sh < input.csv
Example with compressed data:
bzcat dataset.csv.bz2 | ./pipeline4.sh
Run Python scripts:
python3 pipeline5.py
bzcat data.csv.bz2 | ./pipeline4.sh | python3 pipeline5.py
This project was developed as part of a university course (Programmeertechnieken). The focus is on building efficient, composable data pipelines using standard Unix tools.
- Add more robust error handling
- Support for additional file formats (JSON, Parquet)
- Performance benchmarking
- Packaging as a reusable CLI tool
Mohamed Azahrioui
This project was originally developed in a private university GitLab environment (LIACS).
The commit history here is limited because the project was later migrated to GitHub.
The focus of this repository is to showcase the final implementation and functionality.