How is deep learning on Amazon EC2 awful this week?

March 16, 2017 — March 16, 2017

computers are awful
concurrency hell
premature optimization

I want to do cloud machine learning.

Let’s try this on Amazon Web Services and see what’s awful.

Yak shaving risk.

I don’t want to do anything fancy here, just process a few gigabytes of MP3 data. My data is stored in the AARNET ownCloud server. It’s all quite simple, but the algorithm is just too slow without a GPU, and I don’t have a GPU machine I can leave running. I’ve developed it in Keras v1.2.2, which depends on TensorFlow 1.0.

I was trying to use Google for this, but I got lost in working out their big-data-optimised algorithms and then discovered they weren’t even going to save me any money over Amazon, so I may as well just take the easy route and do some Amazon thing. Gimme a fancy computer with no fuss please, Amazon. Let me run my TensorFlow.

1 Preliminaries

Howto guide from Bitfusion, and the Keras run-through.

If you want to upload or config locally, you should probably get the AWS CLI.

pip3 install awscli
aws configure

You will need to set a password to use X11 GUIs.

2 Attempt 1: Ubuntu 14.04

I will use the elderly and unloved Ubuntu NVIDIA images, since they support ownCloud.

First, we fire up tmux to persist jobs between network implosions.

Now, install some necessary things:

sudo apt install virtualenvwrapper  # build my own updated Python
sudo apt install owncloud-client-cmd   # sync my files
sudo apt install libcupti-dev # recommended CUDA tools

Great. That all works

owncloudcmd -u vice.chancellor@unsw.edu.au -p password1234 ~/Datasets https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/Datasets

Oh, that segfaults. So perhaps they don’t support ownCloud. Bugger it, I’ll download my data manually. Let me at the actual calculations.

wget -r -nH -np --cut-dirs=1 -U Mozilla --user=vice.chancellor@unsw.edu.au --password=password1234 https://cloudstor.aarnet.edu.au/plus/remote.php/webdav/Datasets

Huh, a 401 error. Hmm.

Well, I’ll rsync from my laptop. While that’s happening, I’ll upgrade TensorFlow.

pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.0.1-cp34-cp34m-linux_x86_64.whl

Oh no, turns out the shipped NVIDIA libs are too old for TensorFlow 1.0. (i.e. version 7.5 instead of the required 8.0). GOTO NVIDIA’s CUDA page, and embark upon a complicated install procedure. Oh wait, I need to register first.

<much downloading drivers and running mysterious install scripts omitted, after which it seems to claim to work.>

Oh, it’s missing ffmpeg. How about I fix that with some completely unverified packages from some guy on the internet? I could have compiled it myself, I guess?

sudo add-apt-repository ppa:mc3man/trusty-media
sudo apt install ffmpeg

Now I run my code.

Well, that bit kinda worked, except that now my TensorFlow instance can’t see the video drivers at all. There’s no error, it just doesn’t see the GPU.

So I’m paying money for no reason; this calculation fact goes slightly faster on my laptop, for which I only pay the cost of electricity.

Bugger it, I’ll try to use the NVIDIA-supported AMI. That will be sweet, right?

nvidia-smi
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63         |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap|         Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
|   0 GRID K520 Off  | 0000:00:03.0 Off |                  N/A |
| N/A 32C P0 35W / 125W |     11MiB /  4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU PID Type Process name Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Summary: This turned out to be a terrible idea, as the machine ultimately doesn’t actually include updated useful GPU libraries, and I need those for TensorFlow 1.0. I still had to join NVIDIA’s development program and download gigabytes of crap. Then I broke it. If you are going to do that, you may as well just go to Ubuntu 16.04 and at least have modern libraries. Or Amazon Linux, see below.

3 Attempt 2: Amazon Linux AMI

Firstly, we need tmux for persistent jobs.

sudo yum install tmux
tmux ls
failed to connect to server
tmux new lg
[exited]

Ah, so tmux doesn’t work on AMI Linux? Maybe they have no users with persistent remote jobs?

Uhhh. OK, well I’ll ignore that for now and install ffmpeg to analyse those MP3s.

sudo yum install ffmpeg
No package ffmpeg available.

Arse.

The forums recommend downloading some guy’s ffmpeg builds. (extra debugging info here) Or maybe you can install it from a repo?

ARGH my session just froze and I can’t resume it because I have no tmux. Bugger this for a game of soldiers.

4 Attempt 3: Ubuntu 16.04

We start with a recent Ubuntu AMI. Unfortunately I’m not allowed to run that on GPU instances.

5 Attempt 4: The original Ubuntu 14.04 image but I’ll take a deep breath and do the GPU driver install properly

Back to the elderly and unloved Ubuntu NVIDIA images.

Maybe I can do a cheeky upgrade?

sudo do-release-upgrade

No, that’s too terrifying.

OK, careful probing reveals that the Amazon G2 instances have NVIDIA GRID K520 GPUs. NVIDIA doesn’t list them on their main overview page, but careful searching will turn up a link to a driver numbered 367.57, so I’m probably looking for a driver number like that. And “compute capability” 3.0, I learnt from internet forums.

This is getting silly.

Hmm, maybe I can hope my code is TensorFlow 0.12 compatible?

sudo apt install python-pip python-dev python-virtualenv virtualenvwrapper
sudo apt install python3-pip python3-dev python3-virtualenv
virtualenv --system-site-packages ~/lg_virtualenv/ --python=`which python3`
source ~/lg_virtualenv/bin/activate
~/lg_virtualenv/bin/pip install --upgrade pip # or weird install errors
~/lg_virtualenv/bin/pip install audioread librosa jupyter #I think this will be fine for my app?
jupyter notebook --port=9188 workbooks

Oh crap. Turns out the version of scipy in this virtualenv is arbitrarily broken and won’t import:

from scipy.stats import poisson, geom, expon
ImportError: No module named 'scipy.lib.decorator'

What? OK, that looks like some obsolete version of scipy.

~/lg_virtualenv/bin/pip install --upgrade scipy

AAAAAAAAND now TensorFlow is broken, because the scipy upgrade broke numpy, and I get RuntimeError: module compiled against API version 0xa but this version of numpy is 0x9.

OK, let’s see if I can get my virtualenv to use everything compiled from the parent distro, which will require me to work out how to set up Jupyter to use a virtualenv kernel:

[Instructions here wget https://developer.nvidia.com/compute/cuda/8.0/Prod2/local_installers/cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64-deb sudo dpkg -i libcudnn5_5.1.10-1+cuda8.0_amd64.deb sudo dpkg -i libcudnn5-dev_5.1.10-1+cuda8.0_amd64.deb sudo apt install cuda sudo apt install libcupti-dev sudo add-apt-repository ppa:graphics-drivers/ppa sudo apt –no-install-recommends install nvidia-375 sudo apt update && sudo apt install bazel sudo apt install python3-pip python3-dev python3-virtualenv sudo apt install ffmpeg owncloud-client-cmd # finally. sudo pip3 install jupyter librosa pydot_ng audioread numpy scipy seaborn keras==1.2.2


I put this stuff in `~\.profile`:

```bash
export CUDA_HOME=/usr/local/cuda
export CUDA_ROOT=/usr/local/cuda
export PATH=$PATH:$CUDA_ROOT/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_ROOT/lib64

Build time.

For the ./configure step, I need to know that it seems cudnn ended up in /usr/lib/x86_64-linux-gnu and cuda in /usr/local/cuda, and the python library path somehow ended up in /usr/local/lib/python3.5/dist-packages. The compute capability of the K80 is 3.7, and if you want to use the G2 instances as well, it might run if you also generate the 3.0 version. Although I haven’t tested that.

git clone --recurse-submodules https://github.com/tensorflow/tensorflow
cd tensorflow
./configure
bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
# Go and make a coffee, or perhaps a 3 course meal, because this takes an hour

bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/tensorflow_pkg
sudo pip3 install ~/tensorflow_pkg/tensorflow-1.0.1-cp35-cp35m-linux_x86_64.whl

AAAAAH IT RUNS!

My total bill for this awful experience was

  • 37.28 USD, and
  • approximately 32 hours of work, including the 10-odd hours I pissed against the wall trying google.

Now, hopefully my algorithm does something interesting.

Addendum: I couldn’t make owncloud authenticate and I’m bored of that, so I uploaded the results into an S3 bucket.

The magic policy commands for that are

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::bucket_of_data"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::bucket_of_data/*"
            ]
        }
    ]
}

You can use this to do file sync via such commands as

aws s3 sync --exclude="/.*" ./output s3://bucket_of_data/output  # upload
aws s3 sync --exclude="/.*" s3://bucket_of_data/output ./output  # download

However, AFAICT this can never actually delete files so it’s annoying. I will probably need to manage that with git-annex or rclone. See synchronising files.