I’ve never been so jubilant to see custard apple (score = 0.00147)
in my terminal window. It meant that I had finally classified an image using TensorFlow on my brand new GPU. Despite my confidence as I sat down with the visually appealing official guide, I found the process to be time consuming and frustrating. Based on the number and diversity of issues I saw others having as I Googled (actually DDGed) around, it looks like I’m not alone. As the beneficiary of their hard won experience, I wanted to contribute some of the things that I learned in the process.
I’m going to experiment a bit with the structure, alternating between abstract and specific thoughts. The value of specific thoughts is intuitive, but worth illuminating: None of this article has any value if it doesn’t help you, the reader, do something differently. Not “change your viewpoint” or “deepen your understanding”, but literally tap a different sequence of keys on your keyboard than you would have otherwise. Directly saying “Type this, not that” is the shortest path to this goal, and shorter paths are less likely to be waylaid.
Unfortunately, as Barney the Purple dinosaur tried to warn us, we’re all unique in our own way . This is mostly a good thing, but it can make it difficult to share advice. If nothing else, simply copying my .bash_history would start to fail as soon as you got to paths starting with `/home/mritter/`. You’re smart enough to trivially take that, abstract it up to “He means his home directory” and granularize it back to `/home/jsmith/` or whatever. You’re smart, you could do this, but there’s no reason I should make all of my readers perform that same first step, particularly for less obvious situations.
Specific: Ensure the right graphics driver is being used by blacklisting the default
Even after going through the installation steps, my Samsung Odyssey laptop wasn’t recognizing the existance of my GPU. The final step to fixing this was editing my /etc/modprobe.d/blacklist-nouveau.conf
file to contain:
blacklist nouveau
options nouveau modeset=0
then running sudo update-initramfs -u
and restarting. I could then confirm the recognition of the GPU with `lshw -C video`. I tried other things beforehand (the probably-relevant parts of which will be detailed below), but I can’t know whether they were critical to this final bug or totally separate.
Abstract: The state space with positive outcomes was much smaller than I expected
Because the guides that I was reading were a few months old (which is years in internet time, and centuries in Deep Learning time), I assumed that I should just use the latest version of each suggested library or driver. This assumption has served me well for dozens of previous installation processes, but it failed this time. Perhaps I should have been more suspicious because of the unusual cross-corporate nature of the situation, or maybe you just win some and lose some. I won’t get into all of the other instances where a minor deviation from the advice cause cascading issues, but it was an important reminder that “extremely similar” configurations are not always good enough.
Specific: Be careful with CUDA 9.1!
The first major issue that I identified after trying to follow this comprehensive guide is that I’d installed CUDA 9.1 instead of 9.0 I assumed that since it wasn’t a major version number, it would be fully backwards compatible. To its credit the official documentation mentions the correct version number, but some of the commands it suggests default to the more recent version of various libraries, which have presumably changed since it was published. This short video does a good job of outlining the small changes you need to make for it to work.
Note that you can get away with 9.1 if you build TensorFlow from source. But that sounded like opening up a shipping container of boxes of cans of worms, so I didn’t go down that route.
General: This stuff is still bleeding edge
I’ve always had a romantic notion of what it would have been like to work with steam engines during the Victorian age, or airplanes when they were new. New records being set every day! Limitless opportunity! …And frustrating setbacks caused by obscure parts!
The Wright Brothers, for example, had attempted a flight before the one which went down in history. It took two whole days to repair the ‘minor’ damage that the machine suffered, so that they could make their successful attempt. Their inspiration, a world famous glider pilot named Lilienthal, had (over the course of his 5 years in the spotlight) spent just 5 hours in the air. About half a workday actually doing the thing he was world famous for, the rest of the time handling logistics.
Good user experience fades into the background, and it’s easy to forget how hard and complex things are. When you’re at the bleeding edge, there’s nobody in change of making your experience pleasant, or even guaranteeing that what you want to do is even possible. When you’re lucky enough to find a guide, it usually assumes that you have considerable experience, which will let you fill in the gaps. For example, when was the last time someone digressed from their Stack Overflow answer to clarify “sudo means that you have to type your admin password”? That’s just a common denominator on that website, as are hundreds of other little bits of knowledge. Somehow our computing culture has come together on some de facto curriculum that lets most people understand each other, most of the time. But on the bleeding edge, when you’re talking about graphics drivers and rapidly updating libraries, those gaps can become impossible to bridge.
Specific: These commands are your friend
sudo dpkg/apt-get --purge <package> # Completely remove an installed system package, including drivers
apt list --installed | grep <package> # Search through installed packages (make sure they're all the right version!)
sudo dpkg -l | grep "cuda" # Search through installed packages (make sure they're all the right version!)
lshw -C video # See whether the GPU is visible to the machine
lsmod | grep nvidia # See list of relevant drivers (Make sure none are of the wrong version)
cat /proc/driver/nvidia/version # See Driver information
/usr/lib/nvidia-384/bin/nvidia-smi # See GPU details
The hardest part of the project was not doing things, but UNdoing them. Followed closely by knowing whether I had to undo them in the first place.
General: Learn to quickly Create, Read, Update and Delete in the system you’re debugging
Because I was largely operating in a space that I’m unfamiliar with, I didn’t know how to verify that I was on track until the end of the installation process. That would not have been as bad if the errors I got there had been more specific, but I was left with a diagnosis that boiled down to “One (or more!) of the 10 steps that you took is interacting with one (or more!) of your unknowable number of system configurations incorrectly” That, combined with my lack of fluency with the basic CRUD operations around drivers, made debugging by elimination extremely slow.
Working through it with a friend who both had this background, plus a running system to validate against, was critical for getting mine set up. THANK YOU STAN!!!
Specific: CUDNN doesn’t usually seem to be the root of issues, and CUDA versions often are
The download page is mercifully specific about which CUDA version each CUDNN option requires. I didn’t have to re-install it after moving a bunch of other things around – it really is just a few files, which you can see with the ls /usr/local/cuda-9.0/lib64/libcudnn*
command.
Make sure that you’ve got the right CUDA version (denoted by the 3 digit number) on your PATH (and its parent directory on your LD_LIBRARY_PATH) /usr/lib/nvidia-384/bin
That’s what I’ve got for you. If you’re trying to get TensorFlow set up, I wish you the best of luck – it’s definitely possible (as long as you actually have a GPU!), and actually doesn’t take too long if you’re lucky enough to find a guide that aligns with your needs perfectly If you run into issues, I definitely recommend finding someone who’s been through it before recently. In this and so many things, there’s a lot to be said for good friends! (I’ll take this opportunity to thank Stan again – I couldn’t have done it without the 150+ chat messages that we shared while debugging everything.)