Files

In Unix, everything is a file; see page 27 of the Unix Power Tools.

Line Endings

When manipulating files in Unix, it is necessary to remember that Unix text files have a newline \n at the end of each line. Windows files, in contrast, have a carriage return \r followed immediately by a newline \n. Macintosh files historically (more than 20 years ago) only used a carriage return \r but the Mac operating system is now Unix based (and has been, since 2001), so Macs now use a newline \n just like Unix. Sometimes it is necessary convert files from one type to another, and (in particular) it is sometimes necessary to change the types of characters at the end of each line of text.

For example, to remove the carriage return from a Windows file, so that it can be used on a Mac or Unix machine, you can use:

tr -d '\015' <mypcfile.txt >myunixfile.txt

(see the discussion at the end of section 21.11 in Unix Power Tools. Note that the book was last updated in 2003 and still has outdates information about Mac files, because in 2001, the Mac switched to Unix and both use newline characters \n at the end of lines of text)

Naming files in Unix

It is advisable to use only letters, numbers, underscores, and periods in file names in Unix.

Please try to avoid putting spaces in Unix filenames. If you use a space in a filename, it leads to difficulties, for instance, needing to put double quotes around a file name, or needing to put a backslash to escape each space: "this looks ugly.txt" or my\ silly\ file\ name.txt

Filenames are also case sensitive. In other words, thisismyfile.txt and ThisIsMyFile.txt are different files. You will get comfortable with file naming conventions and will develop your own style. Dr Ward likes to use (mostly) lowercase letters in his files names, for instance, myexample.txt. Some people like to use camel back notation, in which the first word in a file name is not capitalized, but the other words are capitalized, for instance, thisIsMyFile.txt.

File Extensions

Unix usually does not treat files differently, regardless of the file extension. Nonetheless, especially when working with data, it can be helpful to name files in such a way that a user can guess what kind of data is inside the file.

For instance:

  • .csv for comma-separated text files (i.e., with commas between text fields)

  • .tsv for tab-separated text files (i.e., with tabs between text fields)

  • .txt for text files with other separators or without any separators

  • .db for database files

  • .sh for bash shell scripts

  • .gz, .tar, .tar.gz, .tgz, .z, .Z, .zip for compressed files

  • .html, .htm, .xhtml, .xml for HTML or XML files

  • .ps for Postscript files

  • .pdf for Adobe pdf files

  • .c or .cpp for C or C++ source files, respectively

  • a.out is the default name for executable files created by C++

  • file extensions are usually not used with directory names (in other words, it looks weird to use a period in a directory name)

Searching for files

When using ls or mv or grep or other shell commands to work with files, it is possible to identify multiple files using wildcards. A question mark ? allows matching for any one character. For example:

rm drwardfile?.csv

would remove (for instance) the files drwardfile1.csv through drwardfile5.csv and also the files drwardfileA.csv and drwardfileB.csv and drwardfileC.csv etc.

An asterisk * matches any number of occurrences of any group of characters. For instance, if we type

mv mystate*data.txt staterepository

we could move the files mystateINdata.txt and mystateILdata.txt and mystateOHdata.txt and mystateMIdata.txt etc. into the directory staterepository. Such a command would also move any files named, for instance, mystate1999data.txt or mystatedata.txt or mystateasdFGhJKl1234data.txt would all be moved into the directory staterepository too.

Similarly, rm *.csv would remove all files with the extension .csv at the end. The command rm * removes all files in the current directory.

It is also possible to put specific characters or ranges of characters. For instance, mv myfiles199[0-9].txt ninetiesfolder will move all files from myfiles1990.txt to myfiles199o.txt into the directory ninetiesfolder.

Recently Dr Ward wanted to remove all files ending with .csv and .CSV and even files ending in anything in between, for instance, .Csv, and he used rm *.[Cc][Ss][Vv] as one way to achieve the removal of all of these files at once.

==

Unix stores files in a hierarchical structure, with directories (also called folders) and files or more directories inside, possibly over and over. The Mac and Windows operating systems have similar structures of directories (or folders) and files.

See Figure 1-3 of "A Unix filesystem tree" example in Section 1.14 of Unix Power Tools

![](./images/httpatomoreillycomsourceoreillyimages142646.png)

and Figure 1-4 of "Part of a Unix filesystem tree" example in Section 1.16 of Unix Power Tools

![](./images/httpatomoreillycomsourceoreillyimages142648.png)

Some examples from Anvil include, for instance:

/anvil/projects/tdm/data

is the directory for the data sets from The Data Mine projects, and

/home/x-mdw

is Dr Ward’s home directory.

Chapter 10 of the Unix Power Tools book has a great deal of information about how to establish links to files in Unix, including what happens when moving or copying files, and also information about how things work with links in the operating system behind-the-scenes. We do not focus on links to files here, but if you are interested, there are a variety of examples posted there to consider.

Differences between files

The diff and cmp commands can be used to compare differences between files. For instance:

diff myfile1.txt myfile2.txt

shows any differences between these two files. The lines with < at the beginning show the content of myfile1.txt that is missing from myfile2.txt, and similarly, the lines with > at the beginning show the content of myfile2.txt that is missing from myfile1.txt.

If the two files are identical, then diff will not print any output.

On the other hand,

cmp myfile1.txt myfile2.txt

compares the two files but only shows the location of the first difference in the two files. For this reason, cmp is often faster than diff. As with the diff command, the cmp will not print any output if the two files are identical.

less

The less utility is helpful for reading files, when you know that you only want to view the file but not modify it.

quota

Unix systems usually have a way to easily see a user’s quota, i.e., how much space the user has available for saving files. On Anvil, users can check their quota by typing:

myquota

Right now, Dr Ward’s quota looks like this:

Type     Location           Size    Limit    Use        Files    Limit     Use
==============================================================================
home     x-mdw             1.9GB   25.0GB     8%            -        -      -
scratch  anvil             6.9GB  100.0TB  0.01%            0k   1,000k  0.00%
projects x-cis220051      10.1TB   15.0TB    67%        3,084k  10,485k    29%

Users can store files in their home directories, up to a maximum of 25 GB. They can also store files in their scratch directories, up to a maximum of 100 TB. For instance, Dr Ward’s home directory and scratch directory are located at:

/home/x-mdw

and

/anvil/scratch/x-mdw

Scratch directories are sometimes erased at regular intervals. Users should store files that they want to keep in their home directory, and should store temporary files in their scratch directory (for instance, large data files that can be removed after analyzing them).