Volumes
Notes on partitioning and volume management for protected evolution responding to technology changes and deployment requirements.
Architectural Issue
Initially "One Laptop Per School" would just have a standard XO 1.5 (probably 4GiB flash, 512 MiB RAM) plus a DVD writer supporting rewritable 4.5GB discs that is manually turned on and off before and after each use since may drain 20W cf or so for XO. Receives DVDs from upstream (and neighbours) and copies stuff downstream to PodCastPlayer users via exchange of SD/MMC flash cards and also exchanges stuff with Android cell phones using USB or mini or micro SD cards.
School has library of DVDs and individual teachers have own separate collections. Only 1 teacher using at a time, with rostering to take home overnight and weekends. Conceivably additional users with just keypads and audio for UI.
Depending on scheduling of future upgrade to One Laptop Per Teacher, content volumes and power availability and funds, OLPS may optionally have either USB hard drive or USB flash drive (including CardPodCast player) or larger SD/MMC flash card attached or combinations and multiples. Such options would greatly reduce DVD shuffling and perhaps power consumption by maintaining indexes and copying data required by each user. There may also be other local users. OLPS may also have cell phone attachment with offpeak and/or peak access to internet and/or local wireless link.
Potentially also printer.
At least 5 storage media types are relevant: built in flash, USB flash, DVD writeable, DVD rewritable, hard disk. Involves different file system types and complex volume mounting and archiving and backup issues.
There will also be DTN gateways with connections to internet, including larger systems at city datacentres, and smaller ones, perhaps at area post offices in rural market centers. These would not be constrained by power costs and could have substantial hard disk storage and RAM. There could also be power and RAM constrained hubs colocated with existing deployments (including Windows) in order to share internet access.
Design of partitioning, RAID configuration, volume management and file systems are all influenced by by rapidly evolving technology for both online drives and offline interchange and backup media and unpredictable deployment requirements. This affects Storage Access Networks at central data centres and gateways or hubs handling significant volumes of delay tolerant store and forward traffic, with data stores and caches for other services. It may also even affect leaf nodes with evolving physical media for backup and interchange and diverse colocation shared storage environments. Also consider requirements for encryption and database tablespaces.
Proposed Resolution
Plan all drive and other media formats and partitioning in the light of expected future evolution of interchange and backup media. Carefully design volume management and RAID configuration to accommodate fully automated re-arrangements as drives and spindles available and types of media change. Include capabilities for adding alternative operating systems and compatability with virtualization.
RAID
It seems unlikely that this will be relevant at leaf nodes. Any extra spindles should be used as offline backups stored separately for safety.
Highly relevant at data centres but no point considering as part of development since very few data centres and they will decide their own requirements.
Possibly relevant at gateways to DTN internet trunks. Defer till actually need to consider it, then see Linux RAIDand also consider hardware RAID and virtualization if co-located with an existing Windows or other deployment.
Interesting possible application is for using smaller (cheaper) flash drives with pigeon post. Raid 5 would easily enable 3 pigeons with capacity C each to reliably carry 2*C if only 1 disappears (and to receive the data when second pigeon arrives without waiting for last to arrive). More complex Raid 6 using mathematics of galois fields and reed solomon codescan also be easily implemented as standard linux software.
This would enable N+2 pigeons to reliably deliver N*C. Possibly useful if using 4 to 254 pigeons eg only half of 4 pigeons need to arrive. See also architecture document for cleversafe dispersed storage for available block device software with extended use of reed solomon beyond RAID 6.
Given Delay and Disruption Tolerant Networking (DTN) it probably makes more sense to just replicate bundle over N successive periods in a window before acknowledgements expected with corresponding single pigeons. But if the period before acknowledgements are expected is more than 2 it might be a neat future enhancement to use RAID. That situation could arise in downlinks eg from a Rural Market Centre to a school. For the corresponding uplink it could often be simpler to trigger a repetition by a broadcast NAK via radio, or by omission of a broadcast ACK.
Linux Volume Management
Notes below are completely unverified thoughts while reviewing documentation of LVM.
Use of LVM is now standard with Fedora and should not be an issue. An important usage case could be when upgrading HDD. Replace with a larger capacity one but then keep the old smaller one offline for quick full backups of the volatile volumes of the larger spindle. Less volatile volumes cannot fit in full backups to the old smaller spindle but can do incremental backups to optical media.
Also highly relevant for developer PCs trying out different stuff. Still need to think through primary extended partitions for other OS.
Also perhaps for portable drives (including flash) used for installing and configuring colocated gateways.
Some apparantly useful suggestions are in A problem of expectations, not of Linux volume Management . See also link to parent post on problem with adding other operating systems and follow many links from there.
Start with LVM Howto. Review both of them to fully understand concepts (without following details of actual usage). Accept that deployment will involve both complex RAID configurations for performance and redundancy and dynamic availability of drives and partitions in different sizes that can only be managed with LVM.
In these notes a spindle refers to a single physical storage device or a hardware RAID set of such devices. Partition refers to the PC standard of primary, bootable and extended contiguous sets of logical blocks on a single spindle compatible with fdisk. Slice refers to similar BSD style sets of cylinders within a partition. As used in the above documentation, drive refers to a partition, not a spindle.
Gnu Grub is used to boot from any bootable media (including anything on the USB bus) and bring up any virtualization system. Consider Grub2 when released (supposed to be imminent). This includes support for EFI which may be useful on multiboot OLPC XO as well as Macintosh and some other platforms. Also has more internationalization and complex scripting which could make it easier to design a failsafe spare configuration that comes up and prompts operator through recovery procedures when pretty well everything has failed disasterously. Enables use of ext4 in /boot.
Gparted is used for partitioning. Pre-installed Windows laptops (and desktops) may contain special recovery partitions and may make use of extended partitions. There may be a requirement to recover data from them. Unclear whether there are problems with support for extended boot records or LVM and dynamic changes to sizes of all interesting file systems.
Developer Drive
Following notes are on partitioning and volumes of an external USB laptop drive for convenient use with a laptop ignoring internal hard drive.
1. Install from Fedora 10 with no internal hard drive, no network etc. (Later develop separate live USB that boots on any plausible PC and mounts these file systems).
2. Reserve first primary partition as 9 GiB FAT 32 for interchange compatability. Install GRUB to chain load? Mac OSX EFI?
3. Reserve second primary as 32 GiB NTFS for Vista. Assume Vista and Mac OSX can exchange with linux using data from any of FAT32, NTFS and ext3. Worry about virtualization and Mac OS X later.
4. Reserve third primary as 10 GiB to 64 GiB place to keep stuff for restoration and/or full image of 64 GiB USB?
5. All other OS and swap on fourth partition as extended partition. Does this work ok with Gparted and Anaconda etc? Worry about FreeBSD et al later.
6. Perhaps simplest to reserve 2GB for /boot at the end and chop up rest into say 8 GiB chunks which can be combined however desired or released to Windows etc when actually needed?
Developer Flash
Worth doing for eat your own dogfood reasons.
1. Assume 16 GiB for reasonable economy at the moment.
2. Reserve first primary partition as 4.5 GiB FAT 32 for interchange compatability. Also available for swap to flash (hibernate saves complete RAM image in swap space so would need 2 x 2GiB for 2GiB of RAM?). Install GRUB to chain load?
3. Workout appropriate stuff on remaining 11.5 GiB for comprehensive Fedora 10 livecd that can boot anywhere and store multiple persistent homes to same flash or other flash.
4. Meanwhile, to get started, just checkout default Fedora 10 layout for 11.5 GiB (or adjust from default for 16 GiB).
5. Consider using a separate small fast flash drive for swap since it is disposable. Could also be a special boot device. See also slashdot.
6. Mount noatime etc to avoid flash wear. See more detailed advice for Ubuntu EeePC. There should be a similar page for Fedora EeeDora but it seems to be falling behind. Recommends non-journaling ext2 despite unreliability? Better tweak guidance (and some cost info) here.
7. Major requirement while be spin up and mount of both DVD and HDD only when actually in use and remove the power drain completely as soon as not in use. Separate issues will be tuning any timeouts and specifications for hardware costs of standby power switching off and spinup/spindown and reading, writing, seeking, streaming and standby power drain. May need finer grained separation of read only and volatile volumes than is implicit in /var etc. Consider bizarre schemes for some sort of snapshot variation that caches actually read blocks in flash. Sticky bits?
8. Scenario could be similar to "sometimes connected" Coda caches. Gather files needed to flash from HDD or DVD in advance.
9. Likewise Reliable Transport Server should sync bundle fragments to flash initially and moe them to normally offline storage in batches for later batch processing. Consistent with memory constraints and overall Flow Composition architecture.t
10. Checkout whether Scientific Linux liveusb provides more sophisticated overlays than XO or EeePC. Critical and early requirement is for rollback of updates. Full version control for configuration. Presumably more than one layer of overlays for software volumes. Earlier known good system always recoverable - update previous and use current. Combine with offline restoration/installation from DVD or flash.
11. Unclear how to find out what wear levelling an SD card or USB flash actually uses and whether jffs2, LogFS etc are still relevant. Assume XO internal flash already uses optimal system (or will shift to UBIFS when appropriate - see LWN review). Presumably still should not be using file systems optimized for seeking on rotating cylinders on flash devices. UDF might be more relevant.
File Systems
Reiser4 seems superior for some purposes and ext4 is better than ext3 and supported by Fedora 10 anaconda if you add a boot option to the command line. Grub still requires ext 3 for /boot. Differences don't seem important yet.
See phoronix review.
Brtfs or ZFS may end up being the way to go for data centres rather than reiser4 or ext4.
Experiment with these and various others (and with UDF spared) when actually relevant.
Stick to default ext3 for now as not clear when all tools will work smoothly with either - eg Gparted and Clonezilla. Assume Fedora 11 will smoothly upgrade to ext4 and can consider Brtfs or Reiser4 again then.
Open Solaris and Nexenta OS and the ZFS file system look especially interesting for support of flexible, scalable and high performance "appliances" including Networked Attached Storage (NAS).
See also Lustre, Coda, GFS, AFS for DFS.
There's a plausible argumentthat FAT32 is actually optimal for a lot of the sort of batch manipulation we would be doing - especially for constrained memory.
Consider simple object store mapping SHA1 hashes of content to 8.3 folder and file names. Also consider UDF. Checkout Tiers store use of LDAP for metadata.
Vista
See how to resize your vista partition and vista recovery disk download.
For Windows Vista on intel Mac OS X 10.4 Leopard Apple Boot Camp FAQ recommends 15GB unused by HFS.
Database
Mostly we would have low levels of concurrency and be using SQLite via SQLalchemy (or perhaps via Hibernate on Android).
However there could be significant concurrency at larger internet connected hubs (including major datacentres but perhaps also others). These would also have requirements for 24/7 operations to exploit offpeak trunk capacity at night and replication to optical media during working hours. Postgresql is the only plausible choice for database management.
Fully automated database operations including backups, archiving and space recovery would be essential. Also "hot" or at least minimal downtime for addition and replacement of drives.
Full clustering and virtualization would be used in a grid or cloud computing environment at datacentres.
But even smaller hubs could have complex database operations issues. Especially if sharing DVD or storage drives or processors colocated on computers used for other purposes. Much easier if only sharing internet connectivity. They could also be involved in replication of databases to other hubs.
Many issues should be simplified by using LVM "snapshots" to periodically briefly synchronize a database to a consistent state and then immediately resume operations with no significant interruption. The snapshot can then be dumped for replication to archives and other sites and reloaded to a local replica which is then resynchronized with original that has continued live updates during the offline operations.
This may also be much faster than "vacuuming" for reclaiming lost space.
Slightly more than twice the minimum space is required during each periodic resynchronization. This might perhaps be used as RAID 1 after database recsynchronization, with the volume split only after dump to offline media and then rejoined when the old version has been discarded.
To facilitate this a scalable EDOC architecture should be designed using the Flow Composition Model UML profile. This is consistent with both future grid/cloud operations and overall DTN requirements. Batches are "published" to flow through queues to "subscribers". UML2 activity and interaction overview diagrams should be a major focus of both specification and design modeling.
Disk Quotas
Leaf nodes combining DTN with normal use of XO are inherently multi-user systems despite having only one direct human user at a time. In addition they could be used for simultaneous duplication or other activities by additional simultaneous operators that have separate keypads and only audio output.
Contention for CPU and RAM resources could be resolved by simply giving the main human operator total priority and only activating other processes when resources available. Virtualization may consume unacceptable amounts of RAM but could be considered for XO 1.5 with 512MiB.
The standard disk quota system may be unacceptable for use with flash drives and may also be an unecessary overhead. Hard disk quotas could be provided by preallocating separate volumes.
Perhaps soft quotas could be provided by a separate process checking space available and in use periodically and reallocating volume sizes from reserved space as required (while also triggering alerts and changes in parameters to get back within quota).
Consider possibilities of some weird combination of mounting volumes on sparse files, snapshotting and using ext4 preallocation of contiguous 128MiB extents. Somehow or other pre-allocate reserves to trigger action when soft limit reached with temporary action being to simply shrink the preallocation reserve.
Conceivably could also handle out of space situations by paging a 128MiB fragment to offine media while recovering.
Main point is that whatever is done has to be done automatically or by simple prompts for operator to insert and remove media without any possibility of remote sysadmin assistance.

