Why DAR does save UID/GID
instead of plain usernames and usergroups?
In each file property there is
not present the name of the owner nor the name of the group owner, but
instead are present two numbers, the user ID and the group ID (UID
& GID in short). In the /etc/password file theses numbers are
associated names and other properties, like the login shell, the home
directory, the password (see also /etc/shadow).
Thus, when you do a directory list (with the 'ls' command for example
or with any GUI program for another example), the listing application
used
does open each directory, there it finds a list of name and a inode
number associated, then the listing program fetchs the inode attributes
for each file and looks among other information for the UID and the
GID. To be able
to display the real user name and group name, the listing application
calls a given standard C library call that will do the lookup in
/etc/password, eventually NIS system if configured and any other
additional
system, [this way applications have not to bother with the many system
configuration possible, the same API interface is used whatever is the
system], then lookup returns the name if it exist and the listing
application display for each
file found in a directory the attributes and the user name and group
name as returned by the system.
As you can see, the user name and
group name are not part of any file attribute, but UID and GID *are*
instead. Dar is a backup tool mainly, it does preserve at much as
possible the files property to be able to restore them as close as
possible
to their original state. Thus a file saved with UID=3 will be restored
with UID=3. The name corresponding the UID 3 may exist or not,
may exist and be the same or may exist and be different, the file will
be anyway restored in UID 3.
Scenario with dar's way
of restoring
Thus, when doing backup and
restoration of a crashed system you can be confident, the restoration
will not interfere with the bootable system you have used to launch dar
to restore your disk. Assuming you have UID 1 labeled 'bin' in your
real crashed system, but this UID 1 is labeled 'admin' in the boot
system, while UID 2 is labeled 'bin'
in this boot system, files owned
by bin in the system to
restore will be restored under UID 1, not UID 2
which is used by the temporary boot system. At that time after
restoration still running the from the boot system, if you do a 'ls'
you will see that the original files
owned by 'bin' are now owned
by user 'admin'.
This is really a mirage: in your
restoration you will also restore the /etc/password
file and other
system configuration files (like NIS configuration files if they have
been used),
then at reboot time on the newly restored real system, the UID 1 will
be backed associated to user 'bin'
as expected and files originally owned by user bin will now been listed as owned
by bin as expected.
Scenario with plain name way of
restoring
If dar had done else, restoring
the files owned by 'bin' to
the UID corresponding to 'bin',
theses
files would have been given UID 2 (the one used by the temporary
bootable system used to launch dar). But once the real restored system
would
have been launched, this UID 2 would have become some other user and
not 'bin' which is mapped to
UID 1 in the restored /etc/password.
Now, if you want to change some UID/GID when moving a set of
files from
one live system to another system, there is no problem if you are not
restoring dar under the 'root'
account. Other account than 'root'
are
usually not allowed to modify UID/GID, thus restored files by dar will
have group and user ownership of the dar process, which is the one that
has launched dar.
But if you really need to move a
directory tree containing a set of files with different ownership and
you want to preserve theses different ownership from one live system to
another, while the corresponding UID/GID do not match between the two
system, dar can still help you:
- Save your directory tree on the source live system
- From the root account in the destination live system do the
following:
- restore the archive in a empty directory
- change the UID of files according to the one used by the
destination filesystem with the command:
find /path/to/restored/archive
-uid <old UID> -print -exec chown <new name> {} \;
find /path/to/restored/archive
-gid <old GID> -print -exec chgrp <new name> {} \;
The first command will let you remap an UID to another for all files
under the /path/to/restored/archive directory
The second command will let you remap a GID to another for all files
under the /path/to/restored/archive directory
Example on how to
globally modify ownership of a directory tree user by user
For example, you have on the source system three users: Pierre
(UID
100), Paul (UID 101), Jacques (UID 102)
but on the destination system,
theses same users are mapped to
different UID: Pierre has UID 101, Paul has UID 102 and Jacques has UID
100.
We temporary need an unused UID on the destination system, we will
assume UID 680 is not used. Then after the archive restoration in the
directory /tmp/A we will do the following:
find /tmp/A -uid 100 -print -exec
chown 680 {} \;
find /tmp/A -uid 101 -print -exec chown pierre {} \;
find /tmp/A -uid 102 -print -exec chown paul {} \;
find /tmp/A -uid 680 -print -exec chown jacques {} \;
which is:
change files of UID 100 to UID 680 (the files of Jacques are now under
the temporary UID 680 and UID 100 is now freed)
change files of UID 101 to UID 100 (the files of Pierre get their UID
of the destination live system, UID 101 is now freed)
change files of UID 102 to UID 101 (the files of Paul get their UID of
the destination live system, UID 102 is now freed)
change files of UID 680 to UID 102 (the files of Jacques which had been
temporarily moved to UID 680 are now set to their UID on the
destination live system, UID 680 is no more used).
You can then move the modified
files to appropriated destination or
make a new dar archive to be restored in appropriated place if you want
to use some of dar's feature like for example only restore files that
are more recent than those present on filesystem.
Dar_Manager
does not accept
encrypted archives, how to workaround this?
Yes, that's true, dar_manager does not accept encrypted archives. The
first reason is that while dar_manager database cannot be encrypted
this is not very fair to add them encrypted archives. The second reason
is because the dar_manager database should hold the key for each
encrypted archive making this archive the weakest point in your the
data security: Breaking the database encryption would then provide
access to any encryption key, and with original archive access it would
bring access to data of any of the archive added to the database.
OK, there is however a feature in the pipe to provide to dar_manager
the support to encrypt its archives, then next another feature to
provide dar_manager the possibility to store the different archive
keys, then is needed another feature to have key being passed from
dar_manager to dar out of command-line (which would expose the keys to
the sight of other users on your multi-user system), then yet another
feature to be able to feed the database with the archive keys also
without using the command-line. ... well there is a lot of feature to
add and test before you can expect finding it in a released version of
dar.
In the meanwhile, you can proceed as follows:
- isolate your encrypted archive to unencrypted 'extracted
catalogue': Do not use the -K option while isolating, you will however
need to use the -J option to let dar able to read the encrypted
archive. Note that still for key protection, you are encouraged to use
a DCF (Dar Command File, which is a plain file with a list of
options to be passed to dar) file with restricted permissions and
containing the '-J <key>' option to be passed for dar. The dar's
-B option would then receive this filename. this will avoid other users
of your system to have a chance to read the key you have used for your
archives,
- add theses extracted catalogue to the dar_manager database
of your choice,
- change the name and path of the added catalogue to point to
your real encrypted archives (-b and -p options of dar_manager).
Note that the database is not encrypted this will expose the archive
file listing (not the file's contents) of your encrypted archives to
anyone able to read the database, thus it is recommended to set
restrictive permission to this database file.
When will come the time to use dar_manager to restore some file, you
will have to make dar_manager pass the key to dar for it be able to
restore the needed files from the archive. This can be done in several
ways: dar_manager's command-line, dar_manager database or dar.dcf file.
- dar_manager's command-line: simply pass the -e "-K
<key>" to dar_manager . Note that this will expose the key twice:
on dar_manager's command-line and on dar's command-line.
- dar_manager database: the database can store some constant
command to be passed to dar. This is done using the -o option, or the
-i option. The -o option exposes the arguments you want to be passed to
dar because they are on dar_manager command-line. While the -i option,
let you do the same thing but in an interactive manner, this is a
better choice. However, if -i option it is a safe way to feed the
dar_manager database with the '-K <key>' option to be passed to
dar, this option will be received by dar on command-line. Thus still
the key will be visible by other users on your same system.
- The last and best way is to use a DCF file with restrictive
permission. This one will receive the '-K <key>' option for dar
to be able to read the encrypted archives. And dar_manager will ask dar
to read this file thanks to the '-B <filename>' option you will
have give either on dar_manager's command-line (-e -B <filename>
...) or from the stored option in the database (-o -B <filename>).
note that you must prevent other users reading any file holding the
archive key, this covers the dar_manager database as well as the DCF
files you could temporarily use. Second note, in this workaround
approach we have assumed that all encrypted archive do share the same
key.
How to overcome the lack of static linking on MacOS X?
The answer comes from Dave Vasilevsky in an email to the dar-support
mailing-list. I let him explain how to do:
Pure-static
executables aren't used on OS X.
However, Mac OS X does have other ways to build portable binaries.
HOWTO build portable binaries on OS X?
First, you have to make sure
that dar only uses operating-system
libraries that exist on the oldest version of OS X that you care about.
You do this by specifying one
of Apple's SDKs, for example:
export
CPPFLAGS="-isysroot /Developer/SDKs/MacOSX10.2.8.sdk"
export LDFLAGS="-Wl,-syslibroot,/Developer/SDKs/MacOSX10.2.8.sdk"
Second, you have to make sure
that any non-system libraries that dar
links to are linked in statically. To do this edit
dar/src/dar_suite/Makefile, changing LDADD to
'../libdar/.libs/libdar.a'. If any other non-system
libs are used (such
as gettext), change the makefiles so they are also linked in
statically.
Apple should really give us a way to force the linker to do this
automatically!
Some caveats:
* If you build for 10.3 or
lower, you will not get EA support, and
therefore you will not be able to save special Mac information like
resource forks.
* To work on both ppc and x86
Macs, you need to build a universal
binary. For instructions, use Google
* To make a 10.2-compatible
binary, you must build with GCC 3.3.
* These instructions won't work
for the 10.1 SDK, that one is harder to
use.
Why cannot dar use the full power of my
multi-processor computer?
Parallel computing programming is
a science by itself. For having done a specialization in that area
during my studies, I can explain briefly here the constraints. A
program can use several processor if the algorithm it uses is able to
be parallelized. Such an algorithm can either statically (at
programming time) or dynamically (at execution time) be cut in several
independent execution threads. Theses different execution thread must
be as much autonomous as possible between them, if you don't want to
have one thread waiting for another (which is not what we want). The
constraint is this : if you cannot have different thread with no or
very little communication and dependence, parallelization does not
worth it.
Back to dar. From a very abstracted point of view, dar works by
fetching files from the filesystem and by appending their data in a
single file (the archive). For each file, dar records in memory the
location of the data and once all files have been treated, this
location information (contained in the so called "catalogue") is added
at the end of the archive.
One could say that to parallelize file treatment. Instead of proceeding
file by file, let's do all file at the same time (or rather let's say N
files at the same time). OK, but first you would have an important loss
of performance at disk level as the disk heads would spend most of the
time seeking from one of the N file's data to another of the N file's
data. The second point would be that to add a file to the archive you
must know the position of the end of the last added file, which is not
possible to know in advance because of compression and/or
encryption. thus a given thread would have to wait that another
has finished to be able to drop in turn the data of the file it owns...
As you can guess, parallelizing this way would bring worse performance
than the sequential algorithm.
Another possibility is to have several thread doing :
- file lookup (report which file are present on filesystem)
- file filtering (determine which file to save, which file to
compress, and so on)
- file compression
- file encryption
This would be a bit better, but : File lookup is very fast and does not
consume much CPU, as well as file filtering. Instead file
compression or file encryption are very CPU intensive. Thus, first, if
you only use compression OR encryption parallelizing this way will not
bring you much extra power as the encryption or the compression are not
possible to parallelize, rawly you will get the same execution time as
the sequential execution. Second if you use no compression and no
encryption, your CPU will stay idle most of the time and the time to
execute dar will only depend on the speed of your hard disk, so you
will not get any improvement here. Last, only if you use both
encryption and compression you could gain some performance having
parallelization, but dar could only use at most two CPU! no more! And
second, the gain of time will be less than 2 (it will not be twice
faster, but much less) as for a given amount of data, compression needs
much more time to proceed than encryption. Thus the encryption thread
will most of the time wait for compressed data.
OK, you have maybe found also another possibility : having N threads
for compression and M threads for encryption. Assuming encryption
is faster than compression, we could choose N > M. We could
also have a fixed value for N and a dynamic value for M depending on
how fast compression is running. Well, this would let dar be able to
compress and encrypt several files at the same time, assuming that
reading data and data writing time is negligible compared to
compression time (which must be demonstrated as several files have
potentially to be read at the same time), we could maybe have a real
performance gain. But, ... while several files can now be compressed at
the same time, only one can be written to disk at a given time. Thus,
during the time the compression of a file has started and the time it
has finished all other threads have to keep their compressed data in
memory. Then a next thread can drop its data to the archive while all
other keep compressing to memory (RAM). We will quickly lack of RAM! Or
your computer will start to swap, or you have to store the data back to
disk in a temporary file, which file will have to be read again and
wrote back to archive. So, doing so will bring disk performance
degradation, as disk will server for read file's data, writing its
compressed data to temporary file, reading back its compressed data,
writing its compressed data to archive.
Last, when using parallelization there is a always a cost due to
inter-process communication and concurrent I/O operations on the
hardware (here, hard disk are used at the same time to read files to
backup and to write them into the archive). This cost becomes
negligible when the number of parallel thread increase, assuming all
thread are well busy ... here there is a bottleneck, which is the
archive creation that seems to avoid a real impressive parallelization.
Conclusion, unless you can find another way to parallelize dar, it will
not bring noticeable improvement to have a parallelized version of dar.
Parallelization is strongly related to the algorithm used, some
algorithms are well adapted to this operation some others are not.
Is libdar thread-safe, which way you mean it is?
libdar is the part of dar's
source code that has been rewritten to be used by external programs
(like kdar). It has been modified to be used in a multi-threaded
environment, thus, *yes*, libdar is thread-safe.
However, thread-safe does not mean that you do not have to take some
precautions in your programs while using libdar (or any other library).
Let's take an example, considering a simple library that provides two
functions that both receive the address of an integer as argument. The
first increments the given integer up to an specific user key pressed,
while the second decrements the given integer up to another user key
pressed. This library is thread-safe in the way that there is no static
variable in it nor it has any given state at a particular time. It is
just a set of two functions.
Now, your multi-threaded program is the following: at a given time you
have one thread running the first library function while another runs
the other library function. All will work fine unless you provided to
both threads the same integer. One thread would then increment it while
the other would decrement it, and you would not have the expected
behavior you could get if you were not using multi-threaded
environment. The problem would be the same if instead of using an
external library you were accessing this same integer from two
different threads at the same time.
Care must thus be taken for two different threads not acting on the same variables at the same time.
This is however possible with the use of posix mutex, which would
define a portion of code (known as a critical section) that cannot be
entered by a thread while another one is accessing it (such a thread is
suspended until the other thread exits the critical section).
For libdar, this is the same, you must pay attention not having two or
more different threads acting on the same data. Libdar provides a set
of classes, which can be seen as a set of type (like a C struct) with
associated functions (known as methods in the object oriented world).
From theses classes, your program will create objects: each object *is*
a variable. Technically, invoking a method on an object is exactly the
same as invoking a function giving it as hidden argument a pointer to
the object ; while semantically, invoking a method is a way to read or
modify this variable (= the object). Thus, if you plan to act on a
given object from several threads at the same time, you must use posix
mutex or any other mean to mutually exclude the access to this object
between all your threads, this way only one thread may read or modify
this variable (=this object) at a given time.
Note that internally libdar uses some static variables. By static
variables, I mean variable that exist even when no thread is running a
libdar function or method. Theses variables are enclosed in critical
sections for libdar's user may use it normally. In other words, this is
transparent to you. For example, to cancel a libdar call, the mechanism
uses an array in which the tid (thread id) by which a call is ran must
be canceled: If you wish to cancel a libdar call ran by thread 10,
another thread will add the tid 10 to this list. At regular
checkpoints, all libdar function check that this same list does not
contain the tid the call is ran from. If so, the call aborts/returns
and the thread can continue its execution out of libdar code. As you
see, several thread may read or write this array of tid at the same
time. thanks to a set of mutex this is transparent to you and for this
reason, libdar can be said to be thread-safe.
|