Python & Big Data: Airflow & Jupyter Notebook with Hadoop 3, Spark & Presto

If youre interested in producing charts of data in Jupyter Notebook then have a look at the Visualising Data with Jupyter Notebook in SQLite blog post as it has several plotting examples using SQL that will work with both Spark and Presto..Airflow Up & Running The following will create a ~/airflow folder, setup a SQLite 3 database used to store Airflows state and configuration set via the Web UI, upgrade the configuration schema and create a folder for the Python-based jobs code Airflow will run..$ cd ~ $ airflow initdb $ airflow upgradedb $ mkdir -p ~/airflow/dags By default Prestos Web UI, Sparks Web UI and Airflows Web UI all use TCP port 8080..If you launch Presto after Spark then Presto will fail to start..If you start Spark after Presto then Presto will launch on 8080 and the Spark Master Server will take 8081 and keep trying higher ports until it finds one that is free..Spark will then pick an even higher port number for the Spark Worker Web UI..This overlap normally isnt an issue as in a production setting these services would normally live on separate machines..With TCP ports 8080 – 8082 taken in this installation Im launching Airflows Web UI on port 8083..$ airflow webserver –port=8083 & I often use one of the following command to see which networking ports are being used..$ sudo lsof -OnP | grep LISTEN $ netstat -tuplen $ ss -lntu Airflows default configuration of the Celery broker and results backend both expect to use MySQL by default..The following will change this to use RabbitMQ instead..$ vi ~/airflow/airflow.cfg Locate and edit the following settings.. More details

Leave a Reply