- Physical Disks and Partitions
- A Closer Look with du
- Simplifying Analysis with sort
- Identifying the Biggest Files
- Keeping Track of Users: diskhogs
- Summary
- Q&A
- Workshop
Identifying the Biggest Files
We've explored the du command, sprinkled in a wee bit of sort for zest, and now it's time to accomplish a typical sysadmin task: Find the biggest files and directories in a given area of the system.
Task 3.4: Finding Big Files
The du command offers the capability to either find the largest directories, or the combination of the largest files and directories, but it doesn't offer a way to examine just files. Let's see what we can do to solve this problem.
-
First off, it should be clear that the following command will produce a list of the five largest directories in my home directory:
# du | sort -rn | head -5 28484 . 13984 ./Lynx 10464 ./IBM 6848 ./Lynx/src 3092 ./Gator
In a similar manner, the five largest directories in /usr/share and in the overall file system (ignoring the likely /proc errors):
# du /usr/share | sort -rn | head -5 543584 /usr/share 200812 /usr/share/doc 53024 /usr/share/gnome 48028 /usr/share/gnome/help 31024 /usr/share/apps # du / | sort -rn | head -5 1471213 / 1257652 /usr 543584 /usr/share 436648 /usr/lib 200812 /usr/share/doc
All well and good, but how do you find and test just the files?
-
The easiest solution is to use the find command. find will be covered in greater detail later in the book, but for now, just remember that find lets you quickly search through the entire file system, and performs the action you specify on all files that match your selection criteria.
For this task, we want to isolate our choices to all regular files, which will omit directories, device drivers, and other unusual file system entries. That's done with -type f.
In addition, we're going to use the -printf option of find to produce exactly the output that we want from the matched files. In this instance, we'd like the file size in kilobytes, and the fully qualified filename. That's surprisingly easy to accomplish with a printf format string of %k %p.
Put all these together and you end up with the command
find . -type f -printf "%k %p\n"
The two additions here are the ., which tells find to start its search in the current directory, and the \n sequence in the format string, which is translated into a carriage return after each entry.
TIP
Don't worry too much if this all seems like Greek to you right now. Hour 12, "Managing Disk Quotas," will talk about the many wonderful features of find. For now, just type in what you see here in the book.
-
Let's see it in action:
# find . -type f -printf "%k %p\n" | head 4 ./.kde/Autostart/Autorun.desktop 4 ./.kde/Autostart/.directory 4 ./.emacs 4 ./.bash_logout 4 ./.bash_profile 4 ./.bashrc 4 ./.gtkrc 4 ./.screenrc 4 ./.bash_history 4 ./badjoke
You can see where the sort command is going to prove helpful! In fact, let's preface head with a sort -rn to identify the ten largest files in the current directory, or the following:
# find . -type f -printf "%k %p\n" | sort -rn | head 8488 ./IBM/j2sdk-1_3_0_02-solx86.tar 1812 ./Gator/Snapshots/MAILOUT.tar.Z 1208 ./IBM/fop.jar 1076 ./Lynx/src/lynx 1076 ./Lynx/lynx 628 ./Gator/Lists/Inactive-NonAOL-list.txt 496 ./Lynx/WWW/Library/Implementation/libhttp://www.a 480 ./Gator/Lists/Active-NonAOL-list.txt 380 ./Lynx/src/GridText.c 372 ./Lynx/configure
Very interesting information to be able to ascertain, and it'll even work across the entire file system (though it might take a few minutes, and, as usual, you might see some /proc hiccups):
# find / -type f -printf "%k %p\n" | sort -rn | head 26700 /usr/lib/libc.a 19240 /var/log/cron 14233 /var/lib/rpm/Packages 13496 /usr/lib/netscape/netscape-communicator 12611 /tmp/partypages.tar 9124 /usr/lib/librpmdb.a 8488 /home/taylor/IBM/j2sdk-1_3_0_02-solx86.tar 5660 /lib/i686/libc-2.2.4.so 5608 /usr/lib/qt-2.3.1/lib/libqt-mt.so.2.3.1 5588 /usr/lib/qt-2.3.1/lib/libqt.so.2.3.1
Recall that the output is in 1KB blocks, so libc.a is pretty huge at more than 26MB!
-
You might find that your version of find doesn't include the snazzy new GNU find -printf flag (neither Solaris nor Darwin do, for example). If that's the case, you can at least fake it in Darwin, with the somewhat more convoluted
# find . -type f -print0 | xargs -0 ls -s | sort -rn | head 781112 ./Documents/Microsoft User Data/Office X Identities/Main Identity/Database 27712 ./Library/Preferences/Explorer/Download Cache 20824 ./.Trash/palmdesktop40maceng.sit 20568 ./Library/Preferences/America Online/Browser Cache/IE Cache.waf 20504 ./Library/Caches/MS Internet Cache/IE Cache.waf 20496 ./Library/Preferences/America Online/Browser Cache/IE Control Cache.waf 20496 ./Library/Caches/MS Internet Cache/IE Control Cache.waf 20488 ./Library/Preferences/America Online/Browser Cache/cache.waf 20488 ./Library/Caches/MS Internet Cache/cache.waf 18952 ./.Trash/Palm Desktop Installer/Contents/MacOSClassic/Installer
Here we not only have to print the filenames and feed them to the xargs command, we also have to compensate for the fact that most of the filenames will have spaces within their names, which will break the normal pipe. Instead, find has a -print0 option that terminates each filename with a null character. Then the -0 flag indicates to xargs that it's getting null-terminated filenames.
CAUTION
Actually, Darwin doesn't really like this kind of command at all. If you want to ascertain the largest files, you'd be better served to explore the -ls option to find and then an awk to chop out the file size:
find /home -type f -ls | awk '{ print $7" "$11 }'
Of course, this is a slower alternative that'll work on any Unix system, if you really want.
-
To just calculate the sizes of all files in a Solaris system, you can't use printf or -print0, but if you omit the concern for filenames with spaces in them (considerably less likely on a more traditional Unix environment like Solaris anyway), you'll find that the following works fine:
# find / -type f -print | xargs ls -s | sort -rn | head 55528 /proc/929/as 26896 /proc/809/as 26832 /usr/j2se/jre/lib/rt.jar 21888 /usr/dt/appconfig/netscape/.netscape.bin 21488 /usr/java1.2/jre/lib/rt.jar 20736 /usr/openwin/lib/locale/zh_TW.BIG5/X11/fonts/TT/ming.ttf 18064 /usr/java1.1/lib/classes.zip 16880 /usr/sadm/lib/wbem/store 16112 /opt/answerbooks/english/solaris_8/SUNWaman/books/REFMAN3B/index/index.dat 15832 /proc/256/as
Actually, you can see that the memory allocation space for a couple of running processes has snuck into the listing (the /proc directory). We'll need to screen those out with a simple grep -v:
# find / -type f -print | xargs ls -s | sort -rn | grep -v '/proc' | head 26832 /usr/j2se/jre/lib/rt.jar 21888 /usr/dt/appconfig/netscape/.netscape.bin 21488 /usr/java1.2/jre/lib/rt.jar 20736 /usr/openwin/lib/locale/zh_TW.BIG5/X11/fonts/TT/ming.ttf 18064 /usr/java1.1/lib/classes.zip 16880 /usr/sadm/lib/wbem/store 16112 /opt/answerbooks/english/solaris_8/SUNWaman/books/REFMAN3B/index/index.dat 12496 /usr/openwin/lib/llib-lX11.ln 12160 /opt/answerbooks/english/solaris_8/SUNWaman/books/REFMAN3B/ebt/REFMAN3B.edr 9888 /usr/j2se/src.jar
The find command is somewhat like a Swiss army knife. It can do hundreds of different tasks in the world of Unix. For our use here, however, it's perfect for analyzing disk usage on a per-file basis.