Installation setup

GithubIssue6301 based on the instruction given in that issue the following setup is done

%pip install "tornado>=5" 
%pip install "dask[complete]"
Requirement already satisfied: tornado>=5 in /usr/local/lib/python3.6/dist-packages (6.0.4)
Requirement already satisfied: dask[complete] in /usr/local/lib/python3.6/dist-packages (2.12.0)
Requirement already satisfied: numpy>=1.13.0; extra == "complete" in /usr/local/lib/python3.6/dist-packages (from dask[complete]) (1.18.5)
Requirement already satisfied: bokeh>=1.0.0; extra == "complete" in /usr/local/lib/python3.6/dist-packages (from dask[complete]) (1.4.0)
Requirement already satisfied: partd>=0.3.10; extra == "complete" in /usr/local/lib/python3.6/dist-packages (from dask[complete]) (1.1.0)
Requirement already satisfied: toolz>=0.7.3; extra == "complete" in /usr/local/lib/python3.6/dist-packages (from dask[complete]) (0.10.0)
Requirement already satisfied: distributed>=2.0; extra == "complete" in /usr/local/lib/python3.6/dist-packages (from dask[complete]) (2.18.0)
Requirement already satisfied: cloudpickle>=0.2.1; extra == "complete" in /usr/local/lib/python3.6/dist-packages (from dask[complete]) (1.3.0)
Requirement already satisfied: PyYaml; extra == "complete" in /usr/local/lib/python3.6/dist-packages (from dask[complete]) (3.13)
Requirement already satisfied: pandas>=0.23.0; extra == "complete" in /usr/local/lib/python3.6/dist-packages (from dask[complete]) (1.0.4)
Requirement already satisfied: fsspec>=0.6.0; extra == "complete" in /usr/local/lib/python3.6/dist-packages (from dask[complete]) (0.7.4)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from bokeh>=1.0.0; extra == "complete"->dask[complete]) (2.8.1)
Requirement already satisfied: packaging>=16.8 in /usr/local/lib/python3.6/dist-packages (from bokeh>=1.0.0; extra == "complete"->dask[complete]) (20.4)
Requirement already satisfied: tornado>=4.3 in /usr/local/lib/python3.6/dist-packages (from bokeh>=1.0.0; extra == "complete"->dask[complete]) (6.0.4)
Requirement already satisfied: Jinja2>=2.7 in /usr/local/lib/python3.6/dist-packages (from bokeh>=1.0.0; extra == "complete"->dask[complete]) (2.11.2)
Requirement already satisfied: pillow>=4.0 in /usr/local/lib/python3.6/dist-packages (from bokeh>=1.0.0; extra == "complete"->dask[complete]) (7.0.0)
Requirement already satisfied: six>=1.5.2 in /usr/local/lib/python3.6/dist-packages (from bokeh>=1.0.0; extra == "complete"->dask[complete]) (1.12.0)
Requirement already satisfied: locket in /usr/local/lib/python3.6/dist-packages (from partd>=0.3.10; extra == "complete"->dask[complete]) (0.2.0)
Requirement already satisfied: psutil>=5.0 in /usr/local/lib/python3.6/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]) (5.4.8)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]) (47.1.1)
Requirement already satisfied: contextvars; python_version < "3.7" in /usr/local/lib/python3.6/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]) (2.4)
Requirement already satisfied: click>=6.6 in /usr/local/lib/python3.6/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]) (7.1.2)
Requirement already satisfied: msgpack>=0.6.0 in /usr/local/lib/python3.6/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]) (1.0.0)
Requirement already satisfied: tblib>=1.6.0 in /usr/local/lib/python3.6/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]) (1.6.0)
Requirement already satisfied: zict>=0.1.3 in /usr/local/lib/python3.6/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]) (2.0.0)
Requirement already satisfied: sortedcontainers!=2.0.0,!=2.0.1 in /usr/local/lib/python3.6/dist-packages (from distributed>=2.0; extra == "complete"->dask[complete]) (2.1.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.23.0; extra == "complete"->dask[complete]) (2018.9)
Requirement already satisfied: pyparsing>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from packaging>=16.8->bokeh>=1.0.0; extra == "complete"->dask[complete]) (2.4.7)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from Jinja2>=2.7->bokeh>=1.0.0; extra == "complete"->dask[complete]) (1.1.1)
Requirement already satisfied: immutables>=0.9 in /usr/local/lib/python3.6/dist-packages (from contextvars; python_version < "3.7"->distributed>=2.0; extra == "complete"->dask[complete]) (0.14)
Requirement already satisfied: heapdict in /usr/local/lib/python3.6/dist-packages (from zict>=0.1.3->distributed>=2.0; extra == "complete"->dask[complete]) (1.0.1)

Dummy data upload

data source: s3.csv')

!wget -O 'dataset.csv' 'https://e-commerce-data.s3.amazonaws.com/E-commerce+Data+(1).csv'
--2020-06-14 05:43:42--  https://e-commerce-data.s3.amazonaws.com/E-commerce+Data+(1).csv
Resolving e-commerce-data.s3.amazonaws.com (e-commerce-data.s3.amazonaws.com)... 52.216.89.164
Connecting to e-commerce-data.s3.amazonaws.com (e-commerce-data.s3.amazonaws.com)|52.216.89.164|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 45580638 (43M) [text/csv]
Saving to: ‘dataset.csv’

dataset.csv         100%[===================>]  43.47M  11.8MB/s    in 3.9s    

2020-06-14 05:43:47 (11.2 MB/s) - ‘dataset.csv’ saved [45580638/45580638]

Pandas Performance

Read the dataset using pd.read_csv()

import pandas as pd
%time temp = pd.read_csv('dataset.csv', encoding = 'ISO-8859-1')
CPU times: user 619 ms, sys: 73.6 ms, total: 692 ms
Wall time: 705 ms

Dask Setup

  1. Settin up the dask standalone cluster
  2. Read using dask dataframe

Setting up standalone cluster

from dask.distributed import Client, progress
client = Client(processes=False, threads_per_worker=4,
                n_workers=1, memory_limit='2GB')
client

Client

Cluster

  • Workers: 1
  • Cores: 4
  • Memory: 2.00 GB

Dask Performance

import dask.dataframe as dd
%time df = dd.read_csv("dataset.csv", encoding = 'ISO-8859-1')
CPU times: user 21.7 ms, sys: 938 µs, total: 22.7 ms
Wall time: 23.2 ms
Download notebook

(0 downloads)

Post categories:

dask

big data

python

pandas

numpy