Data Processing Pipelines

A collection of data-processing pipelines built using Unix shell tools and Python. These pipelines are designed to efficiently process large CSV datasets using command-line utilities combined with lightweight Python scripts.

Features

Data filtering and transformation using Unix tools (grep, awk, cut, sort)
Stream processing for large datasets (no full in-memory loading)
Integration of Python scripts for calculations and aggregation
Modular pipeline structure for reuse and extension
Efficient handling of compressed files (e.g. .bz2)

Tech Stack

Bash / Unix Shell
Python
Core Unix utilities (grep, awk, sed, sort, uniq, cut)

Project Structure

. ├── pipeline1.sh ├── pipeline2.sh ├── pipeline3.sh ├── pipeline4.sh ├── pipeline5.py ├── pipeline6.py └── README.md

Usage

Make shell scripts executable:

chmod +x pipeline*.sh

Run a pipeline:

./pipeline1.sh < input.csv

Example with compressed data:

bzcat dataset.csv.bz2 | ./pipeline4.sh

Run Python scripts:

python3 pipeline5.py

Example

bzcat data.csv.bz2 | ./pipeline4.sh | python3 pipeline5.py

Notes

This project was developed as part of a university course (Programmeertechnieken). The focus is on building efficient, composable data pipelines using standard Unix tools.

Future Improvements

Add more robust error handling
Support for additional file formats (JSON, Parquet)
Performance benchmarking
Packaging as a reusable CLI tool

Author

Mohamed Azahrioui

Notes

This project was originally developed in a private university GitLab environment (LIACS).
The commit history here is limited because the project was later migrated to GitHub.

The focus of this repository is to showcase the final implementation and functionality.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
README.md		README.md
calcage.py		calcage.py
calcavg.py		calcavg.py
datedayconv.py		datedayconv.py
pipeline1.sh		pipeline1.sh
pipeline2.sh		pipeline2.sh
pipeline3.sh		pipeline3.sh
pipeline4.sh		pipeline4.sh
pipeline5.py		pipeline5.py
pipeline5.sh		pipeline5.sh
pipeline6.py		pipeline6.py
pipeline6.sh		pipeline6.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Processing Pipelines

Features

Tech Stack

Project Structure

Usage

Example

Notes

Future Improvements

Author

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Processing Pipelines

Features

Tech Stack

Project Structure

Usage

Example

Notes

Future Improvements

Author

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages