Click here to Skip to main content
15,867,308 members
Articles / Hosted Services / Storage

Deduplicating NAS Locally

Rate me:
Please Sign up or sign in to vote.
4.22/5 (7 votes)
28 Jan 2014CPOL4 min read 9.3K   4  
I use a somewhat cheap NAS (Network Area Storage) called My Book Live to store stuff that I need to access from several computers. However, only recently I learned how to enable SSH access to the device, and it turned out that this is a PowerPC-based embedded board running Linux:MyBookLive:~# uname

Introduction

I use a somewhat cheap NAS (Network Area Storage) called My Book Live to store stuff that I need to access from several computers. However, only recently I learned how to enable SSH access to the device, and it turned out that this is a PowerPC-based embedded board running Linux:

MyBookLive:~# uname -a
Linux MyBookLive 2.6.32.11-svn48181 #1 Thu Sep 15 18:22:06 PDT 2011 ppc GNU/Linux
MyBookLive:~# cat /proc/cpuinfo
processor       : 0
cpu             : APM82181
clock           : 800.000008MHz
revision        : 28.130 (pvr 12c4 1c82)
bogomips        : 1600.00
timebase        : 800000008
platform        : PowerPC 44x Platform
model           : amcc,apollo3g
Memory          : 256 MB 

Enabling SSH allowed me to mount the device via sshfs which was really helpful (NFS is a bitch to configure if NAS is behind a router, FTP does not preserve timestamps, and SMB has a bunch of its own problems), so I'd like to thank Western Digital for being such a geek-friendly company.

This also meant that I could perform certain operations on files locally, by running software directly on the NAS, without transferring data over the network. One of such (time consuming) operations is deduplication, or finding files that are exact copies of themselves (possibly with changed names). One good tool for that is fdupes which uses checksums to identify 'duplicated' files.

Even though NAS comes with a large collection of Unix software pre-installed, fdupes are missing from the list. The compiler is also missing, of course (who would ship a compiler on an embedded device?), but...

Building GCC is easy!

Well, at least if you have POSIX environment and case-sensitive filesystem (which also includes Windows-based Cygwin, provided that case-sensitivity for NTFS is turned on).

There's a nice tool called crosstool-ng which requires little to no set up if one wants to build GNU toolchain (gcc + friends). The tool is primarily oriented at building cross-compilers, i.e. compiler targeted at some other CPU/platform (ARM, PowerPC, MIPS, you name it) than the system it runs on, but crosstool can build for x86 targets, too. If you are going with one of supplied configurations, then there may be no set up needed at all - and that actually was the case for NAS-targeted gcc.

Having built the crosstool-ng itself (that's a typical configure && make && make install process not worthy of attention), I analyzed the number of supplied toolchain configurations (ct-ng list-samples) and found one that was close enough to the target platform: powerpc-405-linux-gnu.

The important part when building toolchain with crosstool-ng is: don't change paths it wants to use. Or if you do (the tool is set up to litter home directory with sources and output), make sure they are absolute, not relative - or you get a bunch of weird errors late in the process. Also, it helps to configure the toolchain before building :) (ct-ng menuconfig) and remove Java, Fortran and all debugging facilities (gdb, lstrace/strace, duma etc) which can be problematic (and slow) to build.

So after selecting PowerPC 405-oriented toolchain (PowerPC 440 seems to be backward compatible) and waiting for few (or was it several?) tens of minutes for ct-ng build, new shiny gcc with binutils and whatnot was ready for action.

Running on the device.

Here comes the easy part: checking out fdupes sources, compiling it for PowerPC with cross gcc (a single line change in fdupes' Makefile), copying it to the device and running it from there... all worked as is, even though I didn't invest time into figuring out whether kernel on the device was compatible with kernel headers that newly-built GNU toolchain was using.

Now regarding an anticipated question whether it was worth it. The short answer is: probably yes. iostat shows speeds well above what Ethernet 100BASE-T is capable of, and on Gigabit Ethernet one would need to devote significant part of bandwidth to device (and naturally, computer) for the whole duration of this lengthy operation. Also, I can now turn the computer off and leave NAS to "sort the things out" itself.

The catch here is that fdupes, in particular, requires a lot of memory for terabyte-sized data (about gigabyte of RAM). NAS has only 256 MB of RAM and the rest goes to swap - that perhaps slows the process down. I haven't compared performance of local deduplication vs remote one exactly, so I can't say for sure.

However, building software to run on what appeared to be an external disk drive is fun anyway. Now, I have a phone that I can SSH to, a router that I can also install software on, PlayStation that runs Linux (fully legal, never updated past 3.21 system software), and now, somewhat accidentally, also an network disk drive that runs Linux. I'm a happy geek :)

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
-- There are no messages in this forum --