 What is interesting in the deep learning ecosystem is the plentiful choices of deep learning frameworks. On the other side, of course there is another equation; more options equate to more confusion, especially in choosing the most appropriate framework for the entire gamut of the problems. At the end of the day, instead of using one, we may need to stick with multiple deep learning frameworks with each usage depending on the nature of the problem to solve.
What is interesting in the deep learning ecosystem is the plentiful choices of deep learning frameworks. On the other side, of course there is another equation; more options equate to more confusion, especially in choosing the most appropriate framework for the entire gamut of the problems. At the end of the day, instead of using one, we may need to stick with multiple deep learning frameworks with each usage depending on the nature of the problem to solve.
TensorFlow is one of the popular (de facto most popular in terms of Github stars) deep learning frameworks. TensorFlow comes with excellent documentation. This also includes the documentation for installation. If you go to the official documentation page for installation, you will be provided with elaborate installation guide for multiple OS platforms. Then why this post?
The latest version of TensorFlow with GPU support (version 1.8 at the time this post is published) is built against CUDA 9.0. However, NVIDIA has released CUDA 9.1 and there is possibility of newer version release in the near future. Given that TensorFlow is lagging behind the CUDA GA version, the publicly released TensorFlow bundle cannot immediately work on the system having only the latest CUDA version installed. A remedy for this is by installing from source, which can be non-trivial especially for those who are not so familiar with the source build mechanism.
The final system setup after completing the installation steps explained in the posts will be as follows.
| Item | Value | 
|---|---|
| OS | Ubuntu 16.04 | 
| NVIDIA driver version | 390.48 | 
| CUDA version | 9.1 | 
| cuDNN version | 7.1.3 | 
| NCCL version | 2.1.15 | 
| Python version | 2.7.12 | 
| Python install method | virtualenv | 
| TensorFlow version | 1.8.0 | 
Note that the components will be updated in the future. This implies version upgrade for the components. It is expected that this post will still be valid even after version upgrade. Under the circumstances where this post becomes invalid, the content will be updated or another post will be written. Yet, this would be realized with sufficient comments or feedback regarding existing content.
Python 3 users may also wonder if the steps can also be replicated for Python 3. The answer is yes but not immediately. It is recommended to proceed with the installation in another virtualenv that is used specifically by Python 3 and perform the installation with Python 3 equivalent of the command.
Pre Installation
Prior to installation, we will perform some system check for the prerequisites.
Pre 0: Update apt and install the latest package
$ sudo apt-get -y update && sudo apt-get -y upgradePre 1: Check Python version
$ python -VOutput (expected): Python 2.7.12 (or other 2.7 version)
Pre 2: Check GCC version
$ gcc --versionOutput (expected): 5.4.0 (or other 5.x version)
Pre 3: Check NVIDIA driver install status
$ nvidia-smiOutput (expected): GPU information
Troubleshooting:
bash: nvidia-smi: command not foundMeaning: Graphics driver has not been installed
Resolve: Install NVIDIA graphics driver. Refer to this article for the installation steps.
Pre 4: Check CUDA install status
$ nvcc --versionOutput (expected): “… Cuda compilation tools, release 9.1 …”
Troubleshooting:
bash: nvcc: command not foundMeaning: CUDA toolkit has not been installed
Resolve: Install CUDA toolkit. Refer to this article for the installation steps.
Pre 5: Check cuDNN install status
$ locate libcudnn.soOutput (expected): /usr/lib/x86_64-linux-gnu/libcudnn.so
Troubleshooting:
Empty result returned
Meaning: cuDNN has not been installed.
Resolve: Install cuDNN. Refer to this article for the installation steps
Pre 6: Check NCCL install status
$ locate libnccl.soOutput (expected): /usr/local/nccl-2.1/lib/libnccl.so
Troubleshooting:
Empty result returned
Meaning: NCCL has not been installed.
Resolve: Install NCCL. Refer to this article for the installation steps
Please note that even hough NCCL is optional in a TensorFlow installation, we set it as a prerequisite in order to harness GPU parallelism in future use.
Pre 7: Check CUDA profiling tools install status
$ dpkg --get-selections | grep cuda-command-line-toolsOutput (expected): “cuda-command-line-tools-9-1 install”
Troubleshooting:
Empty result returned or status is not “install”
Meaning: CUDA profiling tools library has not been installed.
Resolve: Install the library
Steps to install CUDA profiling tools
7.1 Install CUDA profiling tools using apt
$ sudo apt-get install cuda-command-line-tools-9-17.2 Add the profiling tools library to LD_LIBRARY_PATH
$ vi ~/.profile...
LD_LIBRARY_PATH="/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda-9.1/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"
...Installation Steps
After confirming all checks are passed in the pre installation phase, we will proceed with the TensorFlow installation. We will install the virtualenv at ~/virtualenv/tensorflow.
Step 1: Set locale to UTF-8
$ export LC_ALL="en_US.UTF-8"
$ export LC_CTYPE="en_US.UTF-8"
$ sudo dpkg-reconfigure localesStep 2: Install pip and virtualenv for Python 2
$ sudo apt-get install python-pip python-dev python-virtualenvStep 3: Create virtualenv environment for Python 2 (Virtualenv location: ~/virtualenv/tensorflow)
$ mkdir -p ~/virtualenv/tensorflow
$ virtualenv --system-site-packages ~/virtualenv/tensorflowStep 4: Activate the virtualenv environment
$ source ~/virtualenv/tensorflow/bin/activateVerify the prompt is changed to:
(tensorflow) $Step 5: (virtualenv) Ensure pip >= 8.1 is installed
(tensorflow) $ easy_install -U pipStep 6: (virtualenv) Deactivate the virtualenv
(tensorflow) $ deactivateStep 7: Install bazel to build TensorFlow from source
7.1 Install Java JDK 8 (Open JDK)
$ sudo apt-get install openjdk-8-jdk7.2 Add bazel private repository into source repository list
$ echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
$ curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -7.3 Install the latest version of bezel
$ sudo apt-get update && sudo apt-get install bazelStep 8: Install TensorFlow Python dependencies
$ sudo apt-get install python-numpy python-dev python-pip python-wheelStep 9: Build TensorFlow from source
9.1 Download the latest stable release of TensorFlow (release 1.8.0). We set target directory to ~/installers/tensorflow
$ mkdir -p ~/installers/tensorflow && cd ~/installers/tensorflow
$ wget https://github.com/tensorflow/tensorflow/archive/v1.8.0.zip9.2 Unzip the installer
$ unzip v1.8.09.3 Go the inflated TensorFlow source directory
$ cd tensorflow-1.8.09.4 Configure the build file
$ ./configureSample configuration (note that other than CUDA configuration, your configuration may vary):
Please input the desired Python library path to use.  Default is [/usr/local/lib/python2.7/dist-packages] 
Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: Y
Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: Y
Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: Y
Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: Y
Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: Y
Do you wish to build TensorFlow with XLA JIT support? [y/N]: N
Do you wish to build TensorFlow with GDR support? [y/N]: N
Do you wish to build TensorFlow with VERBS support? [y/N]: N
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N
Do you wish to build TensorFlow with CUDA support? [y/N]: Y
Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 9.1
Please specify the location where CUDA 9.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: /usr/local/cuda-9.1
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 7.1.3
Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-9.1]:/usr/lib/x86_64-linux-gnu
Do you wish to build TensorFlow with TensorRT support? [y/N]: N
Please specify the NCCL version you want to use. [Leave empty to default to NCCL 1.3]:2.1.15
Please specify the location where NCCL 2 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda-9.1]: /usr/local/nccl-2.1
Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.7,3.7]3.7
Do you want to use clang as CUDA compiler? [y/N]: N
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]: /usr/bin/gcc
Do you wish to build TensorFlow with MPI support? [y/N]: N
Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: -march=native
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: N9.5 Build TensorFlow with GPU support
Since the GCC version is >= 5.0
$ bazel build --config=opt --config=cuda --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow/tools/pip_package:build_pip_packageFor GCC version 4 and older:
$ bazel build --config=opt --config=cuda  //tensorflow/tools/pip_package:build_pip_packageTroubleshooting:
ImportError: No module named enumMeaning: The enum module is not available until Python 3.4.
Resolve: Manually install enum module
$ pip install --upgrade enum34ImportError: No module named mockMeaning: The mock module is not installed
Resolve: Install mock module
$ pip install mockno such package @nasm...All mirrors are down: [java 1="Could" 2="not" 3="generate" 4="DH" 5="keypair," 6="sun.security.validator.ValidatorException:" 7="End" 8="user" 9="tried" 10="to" 11="act" 12="as" 13="a" 14="CA" language=".lang.RuntimeException:"][/java]Meaning: One of or all the nasm package mirrors are down. This is a known isssue.
Resolve: Add a new mirror for nasm
$ vi tensorflow/workspace.bzl...
  tf_http_archive(
      name = "nasm",
      urls = [
          "https://mirror.bazel.build/www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2”,
          "http://www.nasm.us/pub/nasm/releasebuilds/2.12.02/nasm-2.12.02.tar.bz2",
          "http://pkgs.fedoraproject.org/repo/pkgs/nasm/nasm-2.12.02.tar.bz2/d15843c3fb7db39af80571ee27ec6fad/nasm-2.12.02.tar.bz2",
      ],
      sha256 = "00b0891c678c065446ca59bcee64719d0096d54d6886e6e472aeee2e170ae324",
      strip_prefix = "nasm-2.12.02",
      build_file = clean_dep("//third_party:nasm.BUILD"),
  )
...9.6 Create the .whl file from the bazel build (We create a directory named tensorflow-pkg)
$ mkdir tensorflow-pkg
$ ./bazel-bin/tensorflow/tools/pip_package/build_pip_package tensorflow-pkg9.7 Activate the virtualenv
$ source ~/virtualenv/tensorflow/bin/activate9.8 (virtualenv) Install the .whl file
– Obtain the .whl file name
(tensorflow) $ cd tensorflow-pkg && ls -al– Install the .whl file via pip (example)
(tensorflow) $ pip install tensorflow-1.8.0-cp27-cp27mu-linux_x86_64.whl9.9 (virtualenv) Verify the installation
(tensorflow) $ pythonimport tensorflow as tf
hello = tf.string_join([‘Hello’,'TensorFlow!’],’ ')
sess = tf.Session()
print(sess.run(hello))
Output after the last line:
Hello TensorFlow!The installation is now complete. We can now use TensorFlow in the system. Don’t forget that since TensorFlow is running in a virtualenv, we need to make sure that the virtualenv is activated when running a TensorFlow program.
Closing Remark
Making TensorFlow work with CUDA 9.1 should not be too daunting. If you have problem with your TensorFlow – CUDA 9.1 installation, or probably tips and trick for the installation, you can simply write in the comment section.
Hi, I can compile with the following instruction but i am having issue while running. Specifically this: NotFoundError: libnccl.so.2: cannot open shared object file: No such file or directory
I installed into the main ubuntu python directory and running a jupyter notebook. (I did try with virtual env, didnt work so ditched that to reduce complexity.)
The full error is as follows:
—————————————————————————
NotFoundError Traceback (most recent call last)
in ()
6 training_targets=training_targets,
7 validation_examples=validation_examples,
—-> 8 validation_targets=validation_targets)
in train_linear_classification_model(learning_rate, steps, batch_size, training_examples, training_targets, validation_examples, validation_targets)
40 # Create a LinearClassifier object.
41 my_optimizer = tf.train.AdagradOptimizer(learning_rate=learning_rate)
—> 42 my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
43 classifier = tf.estimator.LinearClassifier(
44 feature_columns=construct_feature_columns(),
/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/lazy_loader.pyc in __getattr__(self, item)
51
52 def __getattr__(self, item):
—> 53 module = self._load()
54 return getattr(module, item)
55
/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/lazy_loader.pyc in _load(self)
40 def _load(self):
41 # Import the target module and insert it into the parent’s namespace
—> 42 module = importlib.import_module(self.__name__)
43 self._parent_module_globals[self._local_name] = module
44
/usr/lib/python2.7/importlib/__init__.pyc in import_module(name, package)
35 level += 1
36 name = _resolve_name(name[level:], package, level)
—> 37 __import__(name)
38 return sys.modules[name]
/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/__init__.py in ()
34 from tensorflow.contrib import data
35 from tensorflow.contrib import deprecated
—> 36 from tensorflow.contrib import distribute
37 from tensorflow.contrib import distributions
38 from tensorflow.contrib import estimator
/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/__init__.py in ()
20
21 # pylint: disable=unused-import,wildcard-import
—> 22 from tensorflow.contrib.distribute.python.cross_tower_ops import *
23 from tensorflow.contrib.distribute.python.mirrored_strategy import MirroredStrategy
24 from tensorflow.contrib.distribute.python.monitor import Monitor
/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/cross_tower_ops.py in ()
21 import six
22
—> 23 from tensorflow.contrib.distribute.python import cross_tower_utils
24 from tensorflow.contrib.distribute.python import values as value_lib
25 from tensorflow.python.client import device_lib
/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/distribute/python/cross_tower_utils.py in ()
21 import collections as pycoll
22
—> 23 from tensorflow.contrib import nccl
24 from tensorflow.python.framework import dtypes
25 from tensorflow.python.framework import ops
/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/nccl/__init__.py in ()
28 from __future__ import print_function
29
—> 30 from tensorflow.contrib.nccl.python.ops.nccl_ops import all_max
31 from tensorflow.contrib.nccl.python.ops.nccl_ops import all_min
32 from tensorflow.contrib.nccl.python.ops.nccl_ops import all_prod
/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/nccl/python/ops/nccl_ops.py in ()
28
29 _nccl_ops_so = loader.load_op_library(
—> 30 resource_loader.get_path_to_datafile(‘_nccl_ops.so’))
31
32
/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/util/loader.pyc in load_op_library(path)
54 return None
55 path = resource_loader.get_path_to_datafile(path)
—> 56 ret = load_library.load_op_library(path)
57 assert ret, ‘Could not load %s’ % path
58 return ret
/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/load_library.pyc in load_op_library(library_filename)
54 RuntimeError: when unable to load the library or get the python wrappers.
55 “””
—> 56 lib_handle = py_tf.TF_LoadLibrary(library_filename)
57
58 op_list_str = py_tf.TF_GetOpList(lib_handle)
NotFoundError: libnccl.so.2: cannot open shared object file: No such file or directory
Managed to fix by following this:
https://serverfault.com/questions/201709/how-to-set-ld-library-path-in-ubuntu
used the method in the link to add nccl to ldconfig
This website is a great resource! Thank You!
hi Osanda,
glad to know that your issue was resolved. i was a little bit occupied by other things and couldn’t spend time looking at the new comments but you can stay tuned for upcoming posts.
I ran into an error – missing input file ‘@local_config_nccl//:nccl/NCCL-SLA.txt’ and solved it simply by following this : http://www.like2do.com/it-answers?id=52729163&ttl=NCCL+install+%2F+Tensorflow+complie+error+with+RTX+2080ti
System : Ubuntu 16.04, CUDA 9.2, CuDNN 7.1.3, NCCL 2.3, GTX1060
Hi Gerard, thanks for your update. This is a confirmed TensorFlow issue. Here the issue page on Github, as a reference for others: https://github.com/tensorflow/tensorflow/issues/19679