Message boards : Number crunching : ninirosetta_database slot folder
Author | Message |
---|---|
Professor Ray Send message Joined: 7 Dec 05 Posts: 35 Credit: 528,961 RAC: 185 |
In the slot 2 folder of my currently executing WU is a subfolder - minirosetta_database - that comprises 309MB (316 w/ slack). This guy is a little bit of a prollem. I layout the HDD in three sections according to Pareto's Rule, 80% o the time 20% of the files are actually used. The outter tracks are reserved for high performance related files - most frequently used - and are sorted by last modification time descending. The middle section is dedicated to free space and the NTFS metafiles, e.g., $MFT, $MFT reserved space, $USN_Journal, etc. NTFS metafiles and the page file are continguous and bisect the freespace. The inner tracks are the least used sorted by last access date descending. With this scheme, modified files will have near immediate access to free-space. File fragmentation is constrained to within less than 15% of the disk area. Moreover, files that do not meet high performance criterion - most frequently used - are lain on the very outer tracks of the inner 'archive' section according to date last access. In general this prevents churning of data on the HDD during defrag; files will be consolidated fairly near to where they exist in fragmented form, e.g., previous AV def file that become a 'repair' version if the AV def update fails; the 'live' AV def lives in the very outer tracks - to facilitate boot-time - and the repair copy lives in the outer tracks of the least-used 'archive' section. In theory fine and dandy and in general works fairly well. The problem with the scheme is that the minirosetta_database folder gets laid out 1/3 of the way into the least used portion of the drive. UD3 puts it there based on use, i.e., access or modification. What ends up happening, because the folder is transient from WU to WU, the inner 80% of data is constantly getting thrashed from defrag to defrag predicated on whether a Rosetta WU is being crunched, or not. What good is that folder if its not being used? |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
In the slot 2 folder of my currently executing WU is a subfolder - minirosetta_database - that comprises 309MB (316 w/ slack). This guy is a little bit of a prollem. That folder does get used. The amount of use will depend on the type of Rosetta job being run. For the majority of tasks it reads important database files at the start only. Many of the files do not get used the majority of time but are included because a researcher may use them for a particular task. The rosetta database contains score function, chemical type, rotamer data, etc.., data which has been a culmination of many years of research and development from many different institutions and researchers. |
Professor Ray Send message Joined: 7 Dec 05 Posts: 35 Credit: 528,961 RAC: 185 |
I drilled into the matter a bit and noticed the comprehensive content of the database folder. Moreover, I discerned that content within that folder was indeed being sorted into the outer - high performance - tracks. I don't believe that its using a significant amount of the folder content however. I think what I'm going to end up doing is alter the defrag method based on the presence of Rosetta WU. Next time its not present on the HDD, I'll do a strict sort defrag as aforementioned; everything will be defrag'd and compacted to sonsolidate free space in a strict sort order. However, after that whenever it shows up again, I'll implement a generic consolidate defrag. That one doesn't implement the strict sort as described, but throws files into either section on a first come first serve basis with the ultimate aim of purely consolidating free space. That, over time, becomes increasingly inefficient as the method throws files into holes wherever it can with the objective being purely that of free space consolidation. That notwithstanding, the minisrosetta_database contents will get sorted into the very outer tracks of the inner - archive - section, and the high performance attributed database files will get lain on the inner tracks of the high performance section. That's cause the stuff already there won't get moved. Very much akin to shaking a jar of various sized grains. That way only the inner most tracks of high performance, e.g., all BOINC slot folders containing WU, checkpoints and result files, and the very outer tracks of the inner archive tracks - containing, e.g., BOINC project exectutable folders - will be affected by defrag. I think this is a design flaw; the project should extract the mniRosetta_database and blow away what it doesn't need for the specific WU its crunching. Space is being wasted by redundant - and unused - data, i.e., that which is archived in the BOINC_Dataprojectsboinc.bakerlab.org_rosetta zip file(s). I fully comprehend why that monstrosity lives there: so 300MB of data doesn't need to be downloaded with each WU. Just seems inefficient use of client resources though. |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
Space is being wasted by redundant - and unused - data, i.e., that which is archived in the BOINC_Dataprojectsboinc.bakerlab.org_rosetta zip file(s). I fully comprehend why that monstrosity lives there: so 300MB of data doesn't need to be downloaded with each WU. Just seems inefficient use of client resources though. The database is only dowloaded once every time when the project app is updated in a zipped form, so much less than that. 300MB of temp files should also not be an issue today. Besides of that, I think there's no need to defrag the drive each time a new WU starts, just let the files be where Windows has put it, it will be deleted when the WU finishes anyway. . |
Professor Ray Send message Joined: 7 Dec 05 Posts: 35 Credit: 528,961 RAC: 185 |
What I've done is create a junction, i.e., reparse point for the //projects/boinc.bakerlab.org_rosetta folder having target being on another disk. That offloads the 207MB minirosetta_database_3d2618f.zip file. I've extracted the contents of that miniroettaDB ZIP to a separate folder, i.e., minirosetta_database, on the same drive. That offloads another 317MB for the transient DB folder, netting a total 1/2GB space saved on %SYSTEMDRIVE%. Is it possible to replace the existing ZIP file with an SFX - implementing a pre-extraction batch / script - or perhaps, ideally, the script itself. The script - either CMD or BAT - would accomplish automagically creation of junction - reparse point - in the //SLOTS working directory for the current WU having target of this once-for-all-time extracted minirosetta_database ZIP. Right now its tedious in that such needs to be accomplished manually. |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
Right now its tedious in that such needs to be accomplished manually. I wouldn't say this needs to be done. Honestly, pretty much nobody cares about that ~300MB per WU and some drive fragmentation, most people let Windows defrag the HDD automatically in the background. I defrag my drives maybe once a year (auto-defrag off), if I have the impression it's time to do it. If you need top performance, better get an SSD, no need to defrag ever or spend time on thinking about how to move rosetta's files around. ;-) It would be however indeed nice, if the database could be extracted only once after download, not because of drive fragmentation or space (we're not in the 90's anymore), but because this takes time, which could be used for crunching (and because it's sometimes a bit annoying on systems with slower drives). . |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
Right now its tedious in that such needs to be accomplished manually. I have a 500GB SSD, I changed the BOINC Data Files to my HDD tho since rosetta uses lots of it (less than vLHC and ATLAS tho) and very frequently. It does slow down the process of loading 8 WUs at the same time (like when I reboot), but I almost never shut down. Also, Professor Ray is running a PIII! My first PC (when I was a kid, about 15 years ago or so) had a 450MHz PIII with an impressive 32MB of RAM. My first "own" laptop had the very fast Tualatin PIII... ahh good times. Now, my cellphone is faster than both lol. With this in mind, maybe OP has space problems, thus this thread. |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
With this in mind, maybe OP has space problems, thus this thread. Space problems are easily solved with an additional drive (which he has as far as I can tell), all the other stuff he's doing is completely unnecessary. The only thing, that makes sense for speed improvement, is moving the slots dir to a different drive than the rest of BOINC, than, when starting a new WU, the database is extracted from one to another drive. (or move just the R@H project folder, same effect) . |
Professor Ray Send message Joined: 7 Dec 05 Posts: 35 Credit: 528,961 RAC: 185 |
FWIW, my Win2003 R2 system runs on a TUV4x PIII-1400S w/ 3x 512MB SDRAM, Adpu160m dual-channel SCSI host controller w/ 3x Fujitsu MAX3073NC 15k HDD platform. %SYSDRIVE% capacity = 11.7GB w/ 1.21GB free, w/ 1.5GB fixed size page file with the NTFS meta-file layout optimized such that directories immediately precede $MFT, followed by custom-defined 'space-reserved for $MFT', followed by $LOG, $BITMAP, $MFTMirr. The other NTFS metafiles, i.e., $SECURE, $REPARSE, $UPCASE, $ATTRDef and $USRJRNL are placed on the tracks outer to, and immediately adjacent, to the track holding the $MFT. Moreover, $USRJRNL has been custom-sized for optimum fragmentation minimization, in that the $USRJRNL file is locked during Windows Protected Mode and therefor can only be defragged at boot-time before Windows Protected Mode starts. Furthermore, default size of both $USRJRNL and 'space-reserved-for-MFT' are stupidly huge, i.e., for the latter it amounts to 25% of HDD - or partition - capacity and is placed by default at the outer 60% tracks. My 11.7GB capacity / 1.2GB free %MFT is 62MB with 34MB 'reserved-for-MFT' space. The $USRJRNL metafile by default is also much larger than necessary. It is a FIFO type sparse file, i.e., when NTFS journaling entries exceed the max size, it truncates the first journal entries, and scatters unmovable blocks throughout the drive. My custom sized $USRJRNL amounts to 5 blocks. Default sized $USRJRNL would scatter hundreds of blocks around the drive that are unmovable. Drive layout is further optimized where 33% of the most used files are organized to the very outer tracks, the remainder sorted to the very inner tracks. The outer tracks are sorted by last modification date descending, and the inner tracks by last access date decending. As such the BOINC projects folders are located in the outer tracks of the inner section, and all the BOINC project slots are on the inner tracks of the outer section. Each is immediately adjacent to the large segment of fee-space which itself is bisected by the $MFT and the page file. Defragging is quick, with minimal thrashing of the HDD in that all fragmentation occurs within one track of all available free-space; 70% of the files are never touched in that they are rarely accessed or modified. The thing is that the MiniRosetta Datase ZIP file lives in the Rosetta Project folder, which in itself isn't a problem. But the contents of this folder get extracted to the Rosetta slots folder, and since those contents are never modified, it gets sorted into very inner tracks during defrag. Now 60% of the HDD gets thrashed during defrag. I fixed the first part of the problem by putting the Rosetta Project folder onto an outter partition of a second drive and created a reparse point - symbolic link - to it into the BOINC Projects folder. BOINC is fat, dumb and happy, and runs as if the Rosetta Project folder lives in %SYSDRIVE% BOINC installation. I fix the second aspect to the problem by creating a junction point for the Rosetta Project slot folder MiniRosetta Database folder that also points to a folder living on the outer partion of the other drive - which incidentally hosts another 1.5 GB page file - the two page files comprising one striped swap file. But dealing with that is manually tedious in that I have to manually extract the contents of the ZIp, pick as link-source, drop as junction in BOINC/Projects/slots, delete the existing miniRosetta_database, and rename the junction point and then delete the trash bin. All of that could be nicely accomplished with a script contained in the minirosetta_database.ZIP if it could be an SFX, i.e., self extracting executable. Then I wouldn't have to do diddle squat as all the reparse - junction - points would be generated by script. One obvious solution would be to migrate the BOINC_Data folder to the alternate HDD where the Rosetta Project folder in specific has been migrated to. However, that HDD hosts a second O/S for my dual-boot system and hosts its own set of mitigating circumstances akin the aforementioned and therefor would be less than ideal; it serves for just this specific use, i.e., hosting Rosetta Project folder and extracted contents of the minirosetta_database.ZIP. BTW, all this talk about reparse / junction points is particular to WinXP technology. In VISTA, Win7 and Win8, the functionality exists for outright 'symbolic links', where specific files can be linked across HDD (which WinXP technology only allows for folders). The sad fact of the matter given my personal financial constraints, the hardware platfor and O/S currently implemented are what they are and won't change in the foreseeable future. |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
The thing is that the MiniRosetta Datase ZIP file lives in the Rosetta Project folder, which in itself isn't a problem. But the contents of this folder get extracted to the Rosetta slots folder, and since those contents are never modified, it gets sorted into very inner tracks during defrag. Now 60% of the HDD gets thrashed during defrag. Than stop permanently defraging your drive, this is completely unnecessary (same applies to the rest of your micro-management). Problem solved. The sad fact of the matter given my personal financial constraints, the hardware platfor and O/S currently implemented are what they are and won't change in the foreseeable future. Have you considered getting an old IDE drive from eBay? In Germany I can buy such drive (or even few of them) for 1€ + shipment. There you could put all BOINC data and would never need to defrag it, just let BOINC work on it. . |
Message boards :
Number crunching :
ninirosetta_database slot folder
©2024 University of Washington
https://www.bakerlab.org