Index  | Recent Threads  | Unanswered Threads  | Who's Active  | Guidelines  | Search
 

Quick Go »
No member browsing this thread
Thread Status: Active
Total posts in this thread: 207
Posts: 207   Pages: 21   [ Previous Page | 12 13 14 15 16 17 18 19 20 21 | Next Page ]
[ Jump to Last Post ]
Post new Thread
Author
Previous Thread This topic has been viewed 13742 times and has 206 replies Next Thread
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12355
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

My last ARP validation was returned at 09:51 GMT (UTC).

Mike
[Feb 11, 2025 12:42:41 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 951
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Some of my older tasks that were waiting for wingmen have now validated, but every task I returned today is still ending up stuck at Pending Validation when the wingman returns... Looks like there's a queue (possibly because of returns near or just after deadline that belatedly got precedence for users with multiple projects?)

I do seem to have picked up a few more tasks, but one of my systems is still dry (should have three tasks, two running) and two of the other three don't have the extra task that would complete their buffers... So instead of managing around 22 to 23 hours per allocated core per day I'm looking at 12 hours or less :-( -- maybe rapid turnaround isn't what's wanted after all and I should join the "several days of work" brigade :-)

Some insight into how WCG decides how much new ARP1 work is available on a per-day (or, better, per shorter interval) basis would be interesting, if possible :-)

Cheers - Al.
[Feb 11, 2025 1:15:26 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Sgt.Joe
Ace Cruncher
USA
Joined: Jul 4, 2006
Post Count: 7659
Status: Recently Active
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

There does seem to be some type of problem with the MCM feed. I have one dry system and the others are operating at half capacity. This is at 03:25 UTC, so in the morning I will see if anything has been resolved.

Cheers
----------------------------------------
Sgt. Joe
*Minnesota Crunchers*
[Feb 11, 2025 3:26:36 AM]   Link   Report threatening or abusive post: please login first  Go to top 
alanb1951
Veteran Cruncher
Joined: Jan 20, 2006
Post Count: 951
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I wonder if the ongoing work at the data centre preparatory for commissioning all the new equipment is causing occasional inter-system problems or lost virtual machines; there certainly are some odd goings on, whatever the reason!...

Once more, the intervals with no available work for ARP1 seem to be stretching out. Also, I note that whilst the ARP1 Project Statistics had reported a fairly typical 3821 results returned at mid-day it only reported 3911 at the midnight collection! Something strange is going on if it only saw 90 more results (presumably 45 WUs as we seem to [mostly] be on Normal generations now).

As for MCM1, when there's a real [but non-crash] problem (e.g. too many retries going to other platforms, or download issues) I find one or more of my systems might end up with a free thread or two, but I'm not seeing that at present... I don't run more than 50% of threads for MCM1 on any system, so my "needs" are fairly low and it usually takes a server outage to run my systems dry! This does mean that I'm less likely to notice the ups and downs of task availability unless I am actively watching :-) -- my sympathies to Sgt. Joe and any others having similar issues.

Cheers - Al.
[Feb 11, 2025 4:05:37 AM]   Link   Report threatening or abusive post: please login first  Go to top 
Grumpy Swede
Master Cruncher
Svíþjóð
Joined: Apr 10, 2020
Post Count: 2158
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

I'm taking a break from WCG, until either MAM Beta starts, or OPNG restarts again.
[Feb 11, 2025 8:21:01 AM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2154
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

As the Statistics history of ARP1 suggests, so far today at noon there haven't been any Results returned -- with Points generated, I should hastily add.

As Al remarked, I also dedicate not more than 50% of my threads to MCM1 (i.e. on my fastest systems).

Adri
[Feb 11, 2025 1:49:24 PM]   Link   Report threatening or abusive post: please login first  Go to top 
gj82854
Advanced Cruncher
Joined: Sep 26, 2022
Post Count: 102
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Take a look at the generations.txt file. There is definitely something wrong somewhere
[Feb 11, 2025 3:06:22 PM]   Link   Report threatening or abusive post: please login first  Go to top 
adriverhoef
Master Cruncher
The Netherlands
Joined: Apr 3, 2009
Post Count: 2154
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

Two tasks of mine have just been marked Valid and I've also received 6 tasks from generations 136, 137 and 138 in the past one and a half hour.

(Output generated by ”wcgstats -frrr= 660217290”)
workunit 660217290
ARP1_0003582_141_0  Linux Ubuntu  Valid  2025-02-05T16:13:38  2025-02-11T09:28:37   23.40/23.40    589.0/587.0
ARP1_0003582_141_1 Linux Fedora Valid 2025-02-05T16:13:43 2025-02-05T23:07:18 6.70/6.74 585.0/587.0

workunit 660217294
ARP1_0003544_141_0  Linux Fedora  Valid  2025-02-05T16:13:43  2025-02-05T23:01:24    6.61/6.66     577.6/580.6
ARP1_0003544_141_1 Linux Ubuntu Valid 2025-02-05T16:13:38 2025-02-11T09:35:37 23.18/23.19 583.6/580.6


Adri
----------------------------------------
[Edit 1 times, last edit by adriverhoef at Feb 11, 2025 3:24:37 PM]
[Feb 11, 2025 3:17:53 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Mike.Gibson
Ace Cruncher
England
Joined: Aug 23, 2007
Post Count: 12355
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

According to generatios.txt, only 60 units were processed in the 24 hours to midday GMT (UTC).

Mike
[Feb 11, 2025 8:13:07 PM]   Link   Report threatening or abusive post: please login first  Go to top 
danwat1234
Cruncher
Joined: Apr 18, 2020
Post Count: 39
Status: Offline
Project Badges:
Reply to this Post  Reply with Quote 
Re: Project Status (First Post Updated)

https://www.cs.toronto.edu/~juris/jlab/wcg.html Archiving Operational Status log here;

February 5, 2025
We were able to compile and test a standalone build of the MAM project, incorporating the small group of WCG libraries that interface with the BOINC client API to handle IO, checkpointing, and graphics in the same way other WCG applications such as MCM1 do.
The MAM1 application is essentially the MCM1 code with some improvements (and new graphics). We were able to generalize the data processing steps required to produce the custom dataset format that MCM1 currently supports. We will now be able to evaluate arbitrary normalized expression datasets from Gene Expression Omnibus (GEO) repository, and launch new projects searching for biomarkers much easier.
Currently, we are working to deploy the MAM1 application in our QA environment to ensure that integration with BOINC, application graphics, and our scripts for generating and processing workunits locally will work as expected.
There remain some issues with QA environment since construction at the data centre was completed. Although we were able to bring up production system on January 10, 2025, many of our QA servers were bricked. We are working with staff at Sharcnet to fix these problems, so that we can properly test the application. Once we are able to evaluate it in QA environment, we can export a satisfactory version from the QA BOINC database, and import into the production environment and get the beta started. We will keep you posted.
January 10, 2025
All our servers are back online. We are downloding processed work units and sending out new ones, and they have the right deadlines (after the initial glitch with the time).
We have noticed 92,194 results that ended up in an error state. We saved all these in a file so we can repair this issue. We'll be able to rescue these, especially since there is no file deleter or db purge daemon running for now, so all data related to these workunits is still on the filesystem.
We will continue to monitor the system to make sure it is stable.
We are also working on getting MAM project into beta.
Thank you all for continuing to support research. Happy new year.
January 9, 2025
BOINC database is up and in a good state. We are waiting on two more servers to regain access to the network, at which point we will be restarting the scheduler, transitioner, assimilators and validators.
All deadlines for outstanding MCM1 work units have been extended to just after 6:00 p.m. Eastern Standard Time on January 15th, 2025.
Web site is up; stats will be updated soon.
Forums are up.
January 8, 2025
Most of our infrastructure is back online. Unfortunately, some issues with the network and specific virtual machines remain. Thus, the BOINC database node remains unavailable, and the website and forums also do not function properly.
Sharcnet data center team is working on to restore access to these instances in priority order.
Once this is resolved, we will have a smooth restart of the workunit management and BOINC components on the backend, and be able to isolate and diagnose any remaining issues as we restart.
January 7, 2025
Networking issues have been resolved. The data centre staff is finalizing our access.
January 4, 2025
Update from the data centre: "Having issues with the physical network, likely can't get it diagnosed and fixed until Monday, January 6.". As a result - we still do not have a connection to our servers.
January 3, 2025
We have been notified that the core system at SHARCNET is coming online now (5pm). They are planning to complete it tonight (January 3). We are waiting for the access to our systems, and will start turning everything back on as soon as we gain login.
December 20, 2024
We have begun testing the Mapping Arthritis Markers (MAM1) prototype based on the current MCM1 code. We expect to launch the beta version of the application shortly after we are back online on the 3rd of January. We will provide a firm launch date once we are back online. The GPU version of the project may take some additional time to develop and test.
We have been reviewing the MCM1 application code and considering alternative upgrade paths for MCM1 and related projects to enable GPU compute for gene signature search in general. We have decided to work towards replacing the SvmLightLib dependency and other related sections of the code with more modern, but still self-contained dependencies that will provide the opportunity to select CPU vs. GPU backend depending on user preferences and device capability. This should also enable us to take advantage of modern instruction sets and possibly other architectures in the future.
We will be able to publish a preliminary benchmark of the old vs. new application code, comparing across a selection of hardware that WCG staff have at home. We expect performance improvements and better memory utilization for the new version of the MCM1 and by extension initial implementation of the MAM1 project.
In light of the merger of this pull request (introducing "BUDA" to the latest BOINC server releases), we find a strong motivation to upgrade our BOINC server version to track BOINC upstream as nearly every bioinformatics application we have experience running in an HPC environment we could run on the grid, if only we could run containers. We have considered multiple different strategies to accomplish this migration in the past, and now that we have occasion to test existing and new applications with the newest BOINC server version during this downtime offsite, we will put together a roadmap after the launch of MAM1 and upgrade BOINC server to use BUDA/containers going forward.
December 9, 2024
We are working with the backups of the BOINC and website/forums databases offsite to prepare the Mapping Arthritis Markers project for the launch in January.
While we have the opportunity, we will look into optimizing the MCM1 and MAM1 applications. One example, we will create an encoding for the fields in the MCM1 configuration file and further compress the already tiny configuration file for MCM1, rather than send it in plain text as before. This update will also be used to streamline the configuration file across individual projects.
We are also exploring the potential for a GPU version of MCM1 and MAM1 projects, and future applications for searching for biomarker signatures.
We are working on tiered caching across the download servers where workunits that have been loaded into BOINC recently will trigger a write of the input files to a distributed in-memory cache, backed up by the local disk of the download servers.
For ARP1 project, we will be merging all file downloads into a single archive. Previously, even the individual files were getting transient HTTP errors/interruptions frequently and had to wait in queue. We did fix that bug before we went down. This should resolve the bug that was interrupting larger downloads and uploads multiple times, and we are also exploring other optimizations that should improve workunit distribution.
Hopefully, these changes will result in a more efficient file transfers, less I/O pressure on NFS and the storage server, and therefore more throughput under load, which we will need as we look to GPU-enable these and future applications on the grid.
December 6, 2024
We are making final preparations to power down. Before powering off all VMs and the storage servers, we will kill all traffic at the load balancer so we can allow the filesystems and databases to settle before taking a final backup.
When we kill traffic at the load balancer overnight on Dec. 6, there will be no access to the website, forums, BOINC scheduler, BOINC uploads, or BOINC downloads.
We have transferred database backups for BOINC and the website/forums, as well as all source code offsite. We will be transferring the final backups overnight as well.
Today (Dec 6), we have been taking snapshots of all production VMs and sometime this evening we should have all instances that make up our backend infrastructure backed up as snapshots, ready to deploy from resilient storage if there are any issues powering up our instances when we are back online in January.
We will be working on the new project (Mapping Arthritis Markers), long-standing issues with device synchronization between the website and BOINC databases, server status page, real-time stats API, and a new look for the forums as well as new badges during the downtime. We look forward to sharing the results of the data center refresh with you in the New Year - thank you for your support and happy holidays!
December 4, 2024
The WCG shutdown date has been moved up from December 9th, 2024 to December 7th, 2024. We were informed by email on November 21st, to shutdown by December 9th because there would be no cooling by that date. Hosting earlier this week asked us to bump the date up to the 7th.
We have determined it is not feasible to migrate the BOINC infrastructure to another site during downtime, and we are still waiting to hear back from UHN personnel who manage our DNS records to see when we can switch the website and forums over to the alternate site. If we do hear back, this will take place between now and December 6th, 2024.
As users pointed out in the forums, it did not really make sense to start pushing new ARP1 workunits given the imminent downtime, and we will stop producing new MCM1 workunits tomorrow to hopefully give the bulk of outstanding workunits a chance to be uploaded. When we power down, after all traffic to the upload servers is stopped, deadlines for outstanding workunits will be extended to cover the downtime.
With the improvements to the cloud environment, we have been informed that the issue with the network agents in our cloud environment causing the website and forums database instance to become inaccessible on the network until hosting intervenes (cause of this past weekend's outage and many previous outages), will be fixed.
November 26, 2024
IMPORTANT: We have been notified of an extended downtime at SHARCNET facility for construction lasting from December 9th, 2024 to January 3rd, 2025. There will be no power and no cooling during this time. We are exploring temporary migration to another site. We will provide an update on what downtime if any can be expected to start on December 9th, 2024. Overall, this upgrade should provide further improvements to the WCG capacity.
Bandwidth has been improved thanks to hosting staff at SHARCNET. In addition, we have more and better hardware devoted to handling downloads and uploads, and a more competent load balancer.
ARP1 will resume in limited quantities over the next few days. We will make an effort to focus on extremes as suggested in the forums and test the imposed rate limits on workunit production as well as total bandwidth of all clients and number of connections per client for ARP1 file transfer specifically.
Forums were down earlier today for an extended period, they are back up now and we apologize for the slow response.
November 20, 2024
SHARCNET is tuning performance on the main network node for our cloud environment, users may have noticed some service interruption today as a result.
We have identified a way to increase bandwidth to the expected level in our cloud environment, which if today's testing correlates will offer at least an order of magnitude improvement and hopefully more.
However, the approach identified will require updates to DNS records hosted by a separate team at UHN and more investigation to confirm feasibility. We will update volunteers when we have firm timelines and if there will be any downtime when we switch DNS over; assuming we do not find a simpler solution that requires no change to DNS records before verifying this is the way to go with SHARCNET's help.
With the additional hosts we have been able to provision and the local disk available to them, we are hopeful that we can more intelligently write input files and output files to local disk based on a modulo of the workunit ID rather than relying on NFS for everything - the local disks are RAID1 SSDs with ~1TB available for caching of downloads in this way and potentially uploads to researchers that never need to touch NFS for more than metadata. This effort will take some time, but we expect to make reasonable progress towards this goal.
November 19, 2024
Working with hosting staff today, we confirmed that the network in our cloud environment is not behaving as expected, the congestion and latency do not make sense at this bandwidth or looking at the number of packets in flight at any given time.
We should expect a large increase in available bandwidth when these issues are resolved, where we though it was only server resources limiting transfer quantity and rates, therefore causing 503s which translated as "Transient HTTP error" in the BOINC client.
Hosting at SHARCNET were able to recommend some kernel settings they often use for hosts fielding many connections, and also change some offload settings on their end to help alleviate the problem while we pursue a full resolution.
With the above changes, we have seen a higher maximum throughput at the load balancer, typically averaging over 100Mbps wheras before the average was under with ~100Mbps spikes, spikes are now hitting ~150Mbps.
November 18, 2024
ARP1 was soft-paused last week, there are only 361 left to claim as of this writing.
Approximately ~29,000 ARP1 workunits must now be uploaded, before we can titrate ARP1 slowly to the right level, after implementing traffic shaping and rate limits in workunit creation, download, and upload.
Deadlines for ARP1 workunits have been extended again on Saturday Nov 16th, 2024, by 5 days. We will continue to extend the deadline of ARP1 workunits.
We fixed bugs in the HAProxy and Apache configuration files that were causing connection resets on large file transfers, we hope users have seen that despite the atrocious speeds large file transfers are more likely to succeed. ARP1 should download a single file and upload a single file. We will work on this.
Traffic shaping and ratelimits will be applied at the load balancer and backend webservers to so that connection slots are less available to client IP addresses specifically for ARP1 file transfers, and the overall bandwidth of all ARP1 transfers is always limited to a reasonable fraction of total available bandwidth.
We provisioned 3 new servers, more to come, with the help of SHARCNET. They have much more CPU, memory, and even a large local disk.
We are pursuing tiered caching of files for download during creation of the workunit to bulk transfer or write directly to local disk of download servers. We will also explore in-memory caching especially for smaller files, we believe this will reduce load on NFS, SAN, storage server, increase available bandwidth server availability and therefore throughput
We migrated all provisioning scripts, code and configuration, as well as build and deploy scripts previously supporting only CentOS 7 to support also Ubuntu 22 as that was the only guest OS we could run on this new server group provisioned by SHARCNET.
We migrated the production load balancer, upgrading HAProxy from v1.8 -> v2.8, to one of these servers. We deployed two new download servers into production in this way, and pending a final fussing with builds that work on CentOS 7 but not yet Ubuntu 22 we will be able to deploy the binaries required to have them accept file uploads as well. We tuned kernel parameters and application configuration files especially HAProxy and HTTPD/Apache2 with some napkin math for these new servers and the existing older CentOS 7 servers as well.
With the massive increase in CPU, memory, and understanding of what does what in each configuration file resulting in optimizations, we are still only doing ~100Mbps in total in our best moments at any given time through the production load balancer. Therefore, it seems we are ultimately bandwidth limited at the moment, likely a combination of low available bandwidth and inefficient use of that bandwidth.
Increasing the maxconn and timeout queue of the new download servers to throw less 503 errors, still resulted in timeouts, but also made all aspects of the network in our environment unuseable due to latency. Users may have noticed in their BOINC clients that at some points over the last two weeks high timeout queue settings and aggressive keep-alive settings would cause retries in the BOINC client to appear to be stuck, not backing off for minutes in some cases, only to be eventually rejected and only rare cases start late finally pushing through the queue to connect through to the backend. We were experimenting to see if we could serve more requests by having them queue up for longer before being hit with a 503 Service Unavailable, but in those cases, as in many other regimes we found that throughput barely increases while the network congestion/latency renders the entire cloud environment unuseable. The website, my terminal, CI/CD jobs, git repos, same or different subnet. We have been aggressively testing in prod and that is bad practice, but we only pushed and evaluated configurations that we believed held promise to improve the situation in some way. We apologize for the chaos and confusion.
[Feb 11, 2025 11:46:08 PM]   Link   Report threatening or abusive post: please login first  Go to top 
Posts: 207   Pages: 21   [ Previous Page | 12 13 14 15 16 17 18 19 20 21 | Next Page ]
[ Jump to Last Post ]
Post new Thread