The Chiron FS HOWTO

Edited by

Luis Otávio de Colla Furquim

Contact

Table of Contents

Introduction

1. How to Install and Test Chiron FS

a) Obtaining and Installing Chiron FS

b) Testing Chiron FS

2. Using Chiron FS

3. Putting in fstab

4. Some Examples

a) File and web servers

i) Network storage

ii) Local storage

iii) Local and network storage

b) Desktop with network backup

5. Replica failures

6. Hash, file descriptors and ulimit considerations

7. Support

Introduction

This software is licensed under the terms of GPLv3.

COVERED CODE IS PROVIDED UNDER THIS LICENSE ON AN "AS IS" BASIS, WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, WITHOUT LIMITATION, WARRANTIES THAT THE COVERED CODE IS FREE OF DEFECTS, MERCHANTABLE, FIT FOR A PARTICULAR PURPOSE OR NON-INFRINGING. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE COVERED CODE IS WITH YOU. SHOULD ANY COVERED CODE PROVE DEFECTIVE IN ANY RESPECT, YOU (NOT THE INITIAL DEVELOPER OR ANY OTHER CONTRIBUTOR) ASSUME THE COST OF ANY NECESSARY SERVICING, REPAIR OR CORRECTION. THIS DISCLAIMER OF WARRANTY CONSTITUTES AN ESSENTIAL PART OF THIS LICENSE. NO USE OF ANY COVERED CODE IS AUTHORIZED HEREUNDER EXCEPT UNDER THIS DISCLAIMER. IN THE EVENT THAT YOUR LOCAL LAW MAKES THIS STATEMENT OR THE LICENSE PARTIALLY OR TOTALLY UNAPLICABLE, THEN YOU HAVE NO AUTHORIZATION TO USE IT AND INSISTING IN SUCH USE IS ILEGAL!

Use at your own risk!

Chapter 1. How to Install and Test Chiron FS

a) Obtaining and Installing Chiron FS

You may download ChironFS from Google Code. There are some choices there:

For the tarball, named chironfs-<current_version>.tar.gz, you will have to make sure that your system have FUSE installed, BOTH BINARY and DEVELOPMENT files. If you have a recent distribution you probably have it installed or can do it with few commands like apt-get, yum, yast, etc. If you don't have it you may download it from its site.

Unpack it wherever you want and you will have a directory named chironfs-<current_version>.

Just enter the directory and compile it with the commands:

cd /path/to/source/chironfs-<current_version>

./configure [any configure options you want]

make

make install

For the source in RPM format, just type:

rpmbuild -ba chironfs-<current_version>.src.rpm

and this will compile ChironFS, generating a chironfs-<current_version><your_architecture>.rpm. Then you may install it with this command:

rpm -i chironfs-<current_version>-<your_architecture>.rpm

For the binaries in RPM format, just type:

rpm -i chironfs-<current_version>-<architecture>.rpm

For the binaries in DEB format, just type:

dpkg -i chironfs_<current_version>_<architecture>.deb



b) Testing Chiron FS



Create 3 directories, one for the virtual filesystem and two for the replicas and do the mount with the commands;

mkdir /virtual /real1 /real2

chironfs /real1=/real2 /virtual

Confirm they are correctly mounted with the command:


df

You should see a line like this:

/real1=/real2   138456396 130613500    809640 100% /virtual



Try to create/modify/delete some files in /virtual and check for the reflections in /real1 and /real2.


At this point you may be convinced that you have everything you need to start replicating your files. So umount it with the command:


fusermount -u /virtual

Chapter 2. Using Chiron FS

In the test above, we just made the simplest use of Chiron FS, but we did not used 2 options that, despite being OPTIONAL, I think that in production environment they are indeed essential.


First, in a corporative environment, we want to share the access to many users. But, FUSE were implemented with individual user defaults and, this way, only the user who mounted the virtual system has the needed visibility, others just see the empty mount point. So, to allow others, we MUST use the FUSE option allow_other.


Secondly, we want to be reported if some replica fails. The system will continue running, but we don't want to discover the failures when the last replica fails and the system stops. We want to get the failed replica up and running as soon as possible! So we need to activate the logs!


So, the example above should get better if we change the mount command to this:

chironfs --fuseoptions allow_other --log /var/log/chironfs.log /real1=/real2 /virtual

If you want more replicas, 3 for example, type:

chironfs --fuseoptions allow_other --log /var/log/chironfs.log /real1=/real2=/real3 /virtual

Now suppose, in this example, that /real2 is a known slow filesystem, I mean much slower than /real1 and /real3. When ChironFS has to write data, it must write to all of them. But when it has to read, it chooses just one to read. Then, we don't want that ChironFS give equal chances of reading to /real2. As ChironFS uses a round robin algorithm, if we don't tell it that /real2 is slow, it will read from /real2 at each third read. So, to avoid this, give a low priority to real2 and ChironFS will read from it only if /real1 AND /real3 fails. To make this, type a colon just before the /real2 path:

chironfs --fuseoptions allow_other --log /var/log/chironfs.log /real1=:/real2=/real3 /virtual





Chapter 3. Putting in fstab

So, to mount it automatically at boot time, simply add lines to your /etc/fstab file as shown in the example below. The first column must start with "chironfs#" followed, without spaces, by the list of replicas separated by equal signs. The second column is you mount point. The third column must be "fuse". The fourth column are the fuse and chironfs options separated by colons. The last two columns are fstab options and works just as you can see in the fstab documentation.



chironfs#/real1=/real2 /virtual fuse allow_other,log=/var/log/chironfs.log 0 0



Chapter 4. Some Examples



So let's see some real world examples:

a) File and web servers



In this example we will see 3 different solutions to replication of data on a LAN with 2 servers, tipically a file server and a web server.



i) Network storage

In this solution, we will need 2 more servers, they will be the servers' servers, sharing their filesystems using any protocol like NFS, SSHFS, etc. For the purpose of the example, we will use NFS. The servers' servers will have the hostnames set to nfs1 and nfs2 and will export the directory /data. The file and web servers will have their hostnames set to fileserver and webserver, respectively and will be the users' servers.



Make the local mount point for all the real and virtual filesystems in each of the users's servers:

mkdir /real1 /real2 /virtual



Update their fstabs by adding these 3 lines:



nfs1:/data /real1 nfs auto 0 0

nfs2:/data /real2 nfs auto 0 0

chironfs#/real1=/real2 /virtual fuse allow_other,log=/var/log/chironfs.log 0 0



Now you may setup your users' servers. Follow the specific instructions of the services you want to setup, attempting to make the fileserver export the /virtual directory to your users and putting the webserver's documents trees below the /virtual directory.



For achieving no single point of failure you should provide network hardware backups, like 2 NICs for each server, connected to redundant bridges.



Finally setup heartbeat in the users' servers to make any failure in one of them be automatically replaced by one of the other users' servers.



ii) Local storage

In this solution, we will make the users' server be the servers' servers too, sharing their filesystems using any protocol like NFS, SSHFS, etc. For the purpose of the example, we will use NFS again. The file and the web servers will have their hostnames set to fileserver and webserver , respectively and will be the users' servers.



Make the local mount point for all the real and virtual filesystems in each of the users's servers:

mkdir /real1 /real2 /virtual



Update the fstab on the fileserver by adding these 2 lines:



webserver:/real1 /real1 nfs auto 0 0

chironfs#/real2=:/real1 /virtual fuse allow_other,log=/var/log/chironfs.log 0 0



Update the fstab on the webserver by adding these 2 lines:



fileserver:/real2 /real2 nfs auto 0 0

chironfs#/real1=:/real2 /virtual fuse allow_other,log=/var/log/chironfs.log 0 0



Now you may setup your users' servers. Follow the specific instructions of the services you want to setup, attempting to make the fileserver export the /virtual directory to your users and putting the webserver's documents trees below the /virtual directory.



For achieving no single point of failure you should provide network hardware backups, like 2 NICs for each server, connected to redundant bridges.



Finally setup heartbeat in the users' servers to make any failure in one of them be automatically replaced by one of the other users' servers.



iii) Local and network storage



You can combine the above 2 solutions in a third solution, having local and network storage.



Make the local mount point for all the real and virtual filesystems in each of the users's servers:

mkdir /real1 /real2 /real3 /virtual



Update their fstabs by adding these 3 lines:



nfs1:/data /real1 nfs auto 0 0

nfs2:/data /real2 nfs auto 0 0

chironfs#/real3=:/real2=:/real1 /virtual fuse allow_other,log=/var/log/chironfs.log 0 0



Now you may setup your users' servers. Follow the specific instructions of the services you want to setup, attempting to make the fileserver export the /virtual directory to your users and putting the webserver's documents trees below the /virtual directory.



For achieving no single point of failure you should provide network hardware backups, like 2 NICs for each server, connected to redundant bridges.



Finally setup heartbeat in the users' servers to make any failure in one of them be automatically replaced by one of the other users' servers.



b) Desktop with network backup



If you have a linux desktop you may want to have a backup of you local files on some network server. This solution applies to make instant backups of your files to the net. BUT BE AWARE! THIS IS NOT A REPLACEMENT OF A REAL BACKUP SOLUTION ANY CHANGE YOU MAKE TO YOUR FILES (DELETES AND UPDATES) WILL REFLECT IN THE NETWORK REPLICA, SO YOU STILL HAVE TO MAKE YOUR REGULAR BACKUPS! THIS SOLUTION IS JUST TO PROVIDE A WAY TO NOT LOOSE YOUR RECENT WORK WHICH IS NOT FOUND IN THE BACKUP JUST BECAUSE YOU DIDN'T MAKE YOUR BACKUP PROCEDURE YET!



Make the local mount point for all the real and virtual filesystems in each of the users's servers:

mkdir /real1 /real2 /virtual



Update your fstab by adding these 2 lines:



nfs1:/data /real1 nfs auto 0 0

chironfs#:/real1=/real2 /virtual fuse allow_other,log=/var/log/chironfs.log 0 0



If the software you want to use with Chiron has specific instructions about raising the limits of opened files, or just if you know that you will have to simultaneously open a high number of files, you MUST read Chapter 6. Hash, file descriptors and ulimit considerations



Chapter 5. Replica Failures



Anytime some of your replica fails chironfs attempts to keep your systems running. If the fail occurs during a read operation, chironfs tries to read from some other replica and in case of success, returns to the caller with no error. In this case, chironfs logs the event if you mounted with the recommended --log option. If the fail occurs during a write operation, chironfs continues trying to write in the other replicas. If at least one of the replicas succeeds in the write, chironfs returns to the caller with no error. But this time, chironfs logs the event AND disable the failed replicas. It means that no further reads or writes will be made from or to it.



At this point, you have to make your monitor policy to be reported of the event. You may use any software, like logwatch, for this purpose. Starting with version 1.1, ChironFS offers another way to control your filesystem health. Now you have the option of mounting a proc-like control filesystem using the --ctl <PATH> (or -c <PATH>) command line option. The control filesystem is composed of one directory for each replica. Their names are the complete replica path with the slashes changed to underscores. Each one of them contain two files: the first is named "status" and contains a number "0" for replicas in good state or a number "2" if the replica is disabled and the data is inconsistent. By echoing "0" or "2" in this file you enable or disable the replica. The second file, named "check_chironfs.sh", is a script intended to run as a nagios plugin. If you want to use it, do not copy it to another path, because its contents change dynamically and it will not work in another path. So, suppose your fstab is something like:



nfs1:/data /real1 nfs auto 0 0

chironfs#:/real1=/real2 /virtual fuse allow_other,log=/var/log/chironfs.log,ctl=/var/run/chironctl 0 0



If you want to monitor your replicas with nagios, just tell nagios to run the scripts located at "/var/run/chironctl/_real1/check_chironfs.sh" and "/var/run/chironctl/_real2/check_chironfs.sh".



You cannot change ownership or permission bits on any of the files/directories in this filesystem. To configure who can access it, change the ownership of its mount point BEFORE mounting it. ChironFS will use this ownership to all files and directories under it.



So, after detecting the failure, correct the cause of the failure on the replica failed server. YOU MUST PROVIDE BY YOURSELF THE RESTABLISHMENT OF THE CONSISTENCY OF THE DATA ON THE FAILED SERVER. I SUGGEST THE USE OF RSYNC TO UPDATE THE DATA. THIS STEP MUST BE DONE BEFORE PUTTING THE REPLICA IN ENABLED STATE AGAIN, OTHERWISE YOU WILL CORRUPT YOUR DATA IN ALL OTHER REPLICAS. To restablish the use of the replica after the consistency recover procedure, supposing your fstab is the same as above and the failed (and disabled) replica is /real2, use the control filesystem to enable it again as below:



echo 0 > /var/run/chironctl/_real2/status



Chapter 6. Hash, file descriptors and ulimit considerations



The hash table functions currently being used are the Robert Jenkins' 32 bit integer hash function as found in this Twang's doc just adapted to the C language. The 64 bit mix function was taken from there too.

In prior version of ChironFS, the size of the hash table was equal to the file-max kernel parameter. It has been lowered to minimize memory usage. This was done without sacrificing performance due to the great distribution achieved by these hash functions.

The file-max kernel parameter is system wide and if all the file-system would be ChironFSed and with a minimal of two replicas we would need only 33% of the old table space. Since the new hash functions are much better distributed and are good enough until 80% of the space used, I decided to use a file-max proportional allocation, using the maximum power of 2 less than 50% of file-max.

The stats showed a low collision rate for 50% of table usage (about 18% of file-max in my box, which means that, using two replicas, I was using 54% of the system file descriptors).



Statistics



I made a hash test program that simulates file descriptors allocations and counts, for each hash slot, the total colisions, directly from hash caculations or indirect when, due a previous colision the system tried to allocate the next address and found it already allocated.

Each pixel of the graphics below represents one address. The whites are free slots, the greens are allocated but without any colision. The reds are addresses with just one colision and the blacks are the ones with more than one colision.

I tested with three tables: small, medium and large. The small is the one currently in use by ChironFS and is sized as the maximum power of 2 less than 50% of file-max. The medium is sized the double of the small table. The large is 4 times the medium one. The tests were done allocating 50% of the small table, 25% of the medium and 6,25% of the large one. In my system, the file-max is 365718. So the small table is 128K file descriptors, the medium is 256K and the large is 1M. The allocations totalized 64K file descriptors.



Figure 1. Small table allocations.



Figure 2. Medium table allocations.



Figure 3. Large table allocations.

So, there is an acceptable colision rate at a high level of concurrently opened files (64k ChironFS file descriptors, using the minimal two replicas, actually means 192k (53%) of the 360k file-max limit).

But the discussion doesn't ends here. After this explanation about the management of the file descriptors and pushing them to the limits, we shall consider how the system treats all that stuff.

The point is that the file-max kernel parameter is the system wide limit to opened file descriptors. But the system limits the amount of opened files by process too! And ChironFS is also under these limitations. When you mount ChironFS, it will try to raise system parameters to the values needed. But it will succeed only if it is running as root! If you run ChironFS as unprivileged user you will have to raise the maximum opened files per process as root editing the file /etc/security/limits.conf making sure it has the lines below to, for example, set the maximum opened files to 65536 for user "www":

www soft nofile 65536
www hard nofile 65536

Note that you have to login again to make the changes valid. If you are in an X-terminal you have to close your X session, login again and then open you X-terminal.

Atention! Check the documentation of the software you pretend to run using ChironFS. For example, if your software documentation says to you that it needs 30000 open files and tells you to change the configuration at /etc/security/limits.conf and the user who will mount ChironFS is an unprivileged user (not root), you have to configure the limits of this user too using the formula: 2 * NF * NR + 6 (2 times number of files times number of replicas plus 6).

For example, to be able to open 30000 files replicated over 3 replicas, you have to set your /etc/security/limits.conf to 180006 (2 * 30000 * 3 + 6). Then your /etc/security/limits.conf, for the user "someuser", for example, should be as this:

someuser soft nofile 180006
someuser hard nofile 180006

So, that's all to make it run properly. Now if you are curious about why make all this math, the reasons are:




Chapter 7. Support



As we are starting to publish this system, many things still have to be done. This software is provide AS IS and so, mail me (put "[CHIRONFS]" in the subject) and I will TRY to do my best to support you. Now there is a discussion list, so users now have this way to solve their problems too.