Warning
This project is not yet considered stable for production use. While it may be suitable for experimentation and development purposes, it is not recommended for production environments. Expect potential breaking changes, bugs, and incomplete features.
This project helps you set up a local development environment with Apache Hadoop running on an Ubuntu Jammy image integrated with Jupyter Notebook using Docker. The Dockerfile and Hadoop configurations are based on bigdatafoundation/docker-hadoop project.
- Docker
- Docker Compose
Installation instructions can be found at https://docs.docker.com/get-started/
-
Clone the Repository:
git clone https://github.com/sadra1f/pyspark-hadoop-notebook.git
-
Pull and Build the Images:
docker compose pull docker compose build
-
Run the Services:
docker compose up -d
Warning
This setup disables Jupyter's default token-based authentication for easier local development. This means the notebook server is accessible to anyone who can reach it. For production or shared environments, it is strongly recommended to re-enable token authentication by commenting out (or deleting) the command
section in the jupyter
service definition within docker-compose.yml
file.
Once the containers are running, Jupyter Notebook will be available in your web browser at http://localhost:8888. You can customize the port by changing the port mapping in the docker-compose.yml
file under the ports
section of the jupyter
service.
To easily manage your notebooks and project files, use the work
directory. This directory is mounted as a volume, ensuring that any changes you make within the Jupyter container are also reflected on your host machine, and vice-versa.
The ports of each service can be changed in docker-compose.yml
file.
Service | Description | Host | Container |
---|---|---|---|
jupyter | Jupyter (PySpark) Notebook | 8888 | 8888 |
namenode | Yarn Resource Manager Web UI | 8088 | 8088 |
namenode | Namenode Web UI | 9870 | 9870 |
namenode | Primary Namenode | 9000 | 9000 |
secondarynamenode | Secondary Namenode | 9868 | 9868 |
datanode-1 | First Datanode | Random | 9864 |
datanode-2 | Second Datanode | Random | 9864 |
Volumes of each service can be changed in docker-compose.yml
file.
Service | Description | Host | Container |
---|---|---|---|
jupyter | Work Directory | ./work | /home/jovyan/work |
namenode | Primary Namenode Data | namenode_data (Managed by Docker) | /data |
secondarynamenode | Secondary Namenode Data | secondarynamenode_data (Managed by Docker) | /data |
datanode-1 | First Datanode Data | datanode_1_data (Managed by Docker) | /data |
datanode-2 | Second Datanode Data | datanode_2_data (Managed by Docker) | /data |