Files in Unix

=============================================================================
Overview:  In this lesson, we will begin studying files, which store long term
           data on disks.  We talk about how files are named, how they are 
           created and destroyed and viewed, showing some simple commands.
=============================================================================

Section                            Topics
-------                            ------

   What is a file?
   File Names
   Contents of files
   Viewing Contents of Files
   Changing Files
   Some basic file commands
   More about File Types
   Creating files
   Deleting and Renaming Files


What is a file?
---------------

     Many people have an intuitive concept of a computer file from work on
Macintoshes, IBMs or other micros.  Just to review, a file is any collection
of data that is stored in a permanent area in a computer system.  Thus, a file
may be on a disk (hard or floppy) or on tape or on other long-term storage
devices.

     Files model real-world entities like documents, forms, reports, memos,
pictures, sound clips, video clips, and almost any other way of storing some
information together.  In the "old days" (twenty years ago or more), files were
datasets, which were huge lists of numbers representing measurements on 
something.

     Every operating system has some way of storing and accessing files.  Some
operating systems distinguish one type of file from another.  For example a
database file is stored in a different way than an executable program.

     In order to be as simple and general as possible, UNIX makes no assumption
about what is in a file, and so it says that any sequence of bytes is a file.  
The exact type of file that it is doesn't matter to UNIX, although the program 
that reads and interprets the file expects it to be a certain type.

     Files may be stored on long-term storage devices like hard or floppy
disks, but they are organized in logical clusters called directories.  (This
is the same concept as directories in MS-DOS or folders in the Macintosh.)

     Every user has a directory called their home directory.  When you log in,
you are placed in your home directory.  You can find out what this is by typing
the command "pwd":

     % pwd

UNIX responds with something like the following (if you username is "jim"):

     /usr1/mnt/dept/jim

More about directories is given in the "directories" tutorial.


File Names
----------

     Files have names so that we can tell UNIX to do things with them.  These
names can be very long in UNIX...up to 512 bytes long, and can even contain 
special characters (with a few exceptions).  UNIX is, in general, case 
sensitive, so a file name "xyz" is NOT the same as "Xyz".

     Contrary to what you might have learned in other places, you can have
numbers as file names, or you can use dashes or pluses, or asterisks or even
blanks.  Blanks, however, present a slight problem, since UNIX doesn't know
where the file's name ends.  Thus, if you have a blank in your file name, you
must enclose the whole name in double quotes, such as "my file".

     The slash is the one forbidden character in file names, since it is used 
to separate directory names.


Contents of files
-----------------

     Though any given file is, to UNIX, just an undifferentiated sequence of
bytes, a file is usually meant to be used with a particular program or command.
We sometimes call this interpretation of the file's contents its type.  For
example, a file that contained a spreadsheet would have different data than
one that contained a video image.

     In UNIX, unlike MS-DOS, the file's name does not need to indicate what
its type is.  For example, a file that contained a C program usually ends in
".c" such as program1.c.  However, there is no rule that says it MUST have
this kind of name.  This is different from other systems, like MS-DOS, where
the file type is part of the file's name.

     The most generic type of file is just ASCII text.  Since ASCII is the
character code that is used to store characters in 8-bit chunks of computer 
memory (called bytes), we say that a file that contains only printable
characters stores ASCII text.  This might be a letter or a memo to somebody,
a document, or a dataset of numbers.

     Other types of files will be discussed as they arise.  Not all of them
use just printable characters, so some care must be used when printing or 
viewing such files.


Viewing Contents of Files
-------------------------

     To see what is in an ordinary text file, a purr-fect command is "cat".
Just follow the cat command with the file's name:

     % cat syllabus

This command just reads the file line by line and prints it out to your termi-
nal, scrolling it if the file is longer than one screen's worth.

     Sometimes this goes by so fast that you can't see it all before it is gone.
A better command to view a file in a controlled way is "less", which shows you
only as many lines as your screen can display at one time.  Then it gives you 
a prompt, which is a colon.  To continue, press the space bar.  Or you can type
q to quit.   Other options include going backwards or advancing the file
one line at a time, rather than one screenful.  As usual, there is help for the
less command.  To see it, just type the letter h.

     % less syllabus

Another similar command is "more" which works more or less like "less".

     If you want to see the end of a file without waiting for the entire file
to scroll past you using cat, the tail command pulls the cat's tail, and shows
you the "rear end" of the file:

     % tail bigfile

There are options which allow you to control how big of a tail section you can
see.  "Head" is an analogous command that shows you just the first part of a
file.


Changing Files
--------------

     Naturally we often need to change the contents of a file.  Sometimes we
make up new files, and other times we change existing files: we edit our term
paper, fix program errors, or enter data into a dataset file.

     The process of changing the contents of a file is called "editing".  There
are many ways to edit a file, often using programs like "vi" or "emacs".  These
programs are called editors and they allow you to view the file on your terminal
screen and to make additions, corrections, or deletions, much as you would 
alter a document in a word processing system on a Macintosh.

     There are other commands that change files.  Sed and Awk are commonly
used for this. They are sometimes called stream editors, or non-interactive
editors because they allow you to specify changes to files, but only in a batch
or group, not interactively as you sit and watch the changes being made on your 
screen.

     There are two main styles of interactive editors.  Older editors are line
editors because they display one line at a time and allow you to specify 
a subcommand to make a change.  You do not "move around" on the screen, however,
in order to specify where to make the change.  Newer editors are more like 
software that most people are familiar with on personal computers.  There is a 
cursor that you can move around and changes happen at the spot where the cursor
is.  They are sometimes called "full-screen editors" or "visual editors" (as if
one used the older line editors with eyes closed!)

     We will not even discuss the older line editors heres, but rather just 
name them; they are "ed" and "ex".

     A separate lesson discusses the popular "vi" editor, which is the most
common "full screen" editor in UNIX.  Another popular editor is "emacs", but
we do not use it at Canisius.


Some basic file commands
------------------------

     One of the most overworked UNIX commands doesn't even deal with files
directly.  It is the "ls" command, which lists the names of files in a
directory.  Since users want to see what files are on disk, they often use
"ls" to do this.

     % ls

This lists the files in the current directory.  You can also give the
name of a directory explicitly, if it is different than the current one.

     % ls /mnt1/dept/meyer

There are a ton of options on the ls command, some of which will be explained 
later.  You can read all about them by doing

     % man ls

     There are two ways to find out how big a file is.  The "wc" command,
which stands for "word count", counts lines, words and bytes.  A word is
defined to be any consecutive sequence of non-blank characters.  Here's an
example:

     % wc file1
     199 445 4827    file1

This says that there are 199 lines in this file, named "file1", and that there
are 445 words and 4827 bytes.

     There are several options with "wc" that allow you to just get the line
count, or just the word count or just the byte count.  Here's an example of
printing out only the line count:

     % wc -l file1
$$$
     Another way to use the "ls" command is with the -l option.  -l stands
for "long information".  Here's an example:

     % ls -l main.c
     -rw-------  1 meyer       4827 June  9 12:44 main.c

The number 4827 is the number of bytes. It is possible, by the way, to have
a file that is 0 bytes long.  It still exists, but contains no information
and uses up no disk space for data, although some small amount of disk space
is used to store the name and attributes of the file.  (The "directories"
tutorial explains more about the long output of "ls".)


More about File Types
---------------------

     Earlier we alluded to the fact that the data in files has a certain type.
For example, some files contain only printable ASCII text and may have been
created by "vi" or some other editor.  Another file may contain compiled
object code.  Other files contain shell scripts, sound files, or video images.

     Often the name suggests the file's type by using an extension (suffix), 
like in MS-DOS.  For example, "mypgm.C" is probably a C source file.  Unlike 
MS-DOS, UNIX files are not limited to just one extension or an extension of 
just three characters.  UNIX does not care what extensions you have on your 
file names.  The extensions are only important because they allow you and the 
utility programs that you are running to identify files.

Here are some of the more common extensions and what they signify:

  .c    C source program
  .o    object program created by some compiler
  .s    assembler source program
  .h    header file for C or some other language
  .a    archive library, used to bundle together compiled subroutines
  .Z    compressed file
  .f    FORTRAN source program
  .p    Pascal source program

But many of the extensions that you find are JUST conventions that are not 
enforced by the compiler and exist only to help the user.  Here are some of
them:

  .tar  archive library produced by "tar" command
  .a    Ada source program
  .ada  Ada source program
  .l    LISP source program

Some file types are printable and some are unprintable, meaning that the ASCII
codes they contain cannot be displayed on a printer or a screen and may in
fact "screw up" the printer or screen.  If you "cat" one of these files to the
screen, bad things may happen.  (Usually the results are never anything more
disastrous than temporarily locking up your terminal.  You will have to turn off
your terminal and login again.)
$$$
     To avoid screwing up your terminal but still determine what type of data
is in a file, the "file" command can be used.  It will make a guess as to what 
type of data is in the file, but it can be fooled.  For example, if the file 
contains code of some unknown programming language, UNIX will just report that 
this file contains "English text".

     % file xyz

If the file type contains "executable" in it, then it is definitely unprintable.
There are several different types of executable files on the SUN UNIX system.
For example, one is

     sparc demand paged dynamically linked executable not stripped

Believe it or not, each of these terms has a definite meaning and is not there
just to confuse or irritate you!


Creating files
--------------

     Some files are created by humans, such as source programs.  These files
are usually created using editors such as "vi".  Other files are created by 
programs when they run, usually as output files.  These files can be subse-
quently changed by using an editor, if they are printable.

     Though the "vi" editor is explained in detail in a later lesson, here we
want to introduce a very simple way to create a printable file of text.  You can
use this method whenever you want to, even when later you know how to use "vi".

     First, let's demonstrate it and then explain it.  Using the "cat" command,
we are going to have UNIX redirect what we type in at the keyboard into a file
called "my.newfile".

     % cat > my.newfile
     Hickory dickory dock
     the mouse ran up the clock

When we are done, we press CONTROL-D, and we will get the shell prompt again.  
CONTROL-D is a general end of input signal in UNIX.

     This works becuase UNIX allows us to alter the direction of our data.
The cat command specifies no filename, so the default is taken to be the 
terminal.  Whatever is typed in at the terminal will be echoed to the screen.
That is not terribly useful in itself.  But if we use the > sign to tell UNIX 
to redirect the data into a file called "my.newfile" then we will accomplish 
our purpose.


Deleting and Renaming Files
---------------------------

     To remove those unwanted files, use rm.  For example,

        rm syllabus.dit

This command is often called "delete" or "erase" in other systems.
But WATCHOUT!!!  Unix is unforgiving and does not let you "un-remove" a file.
What is gone is gone!

     Sometimes files must be renamed.  UNIX does not have a rename command,
but another command, called "mv", which stands for "move", is used to rename
files.  The new name follows the old name.  In the following example, an
existing file "xyz" will be renamed to "abc":

     % mv xyz abc

Move does other things, like placing files into different directories.  This
will be discussed later.