World Community Grid - View Thread - API Returning Conflicting Data

World Community Grid Forums

Category: Support

Forum: Website Support

Thread: API Returning Conflicting Data

Quick Go »

No member browsing this thread

Thread Status: Active
Total posts in this thread: 32

[ ]

Author

This topic has been viewed 3363 times and has 31 replies

foxfire
Advanced Cruncher
United States
Joined: Sep 1, 2007
Post Count: 121
Status: Offline
Project Badges:

10 year badge for Human Proteome Folding - Phase 2

2 year badge for Discovering Dengue Drugs - Together

10 year badge for Nutritious Rice for the World

2 year badge for The Clean Energy Project

20 year badge for Help Fight Childhood Cancer

2 year badge for Influenza Antiviral Drug Search

20 year badge for Help Cure Muscular Dystrophy - Phase 2

10 year badge for Discovering Dengue Drugs - Together - Phase 2

5 year badge for The Clean Energy Project - Phase 2

20 year badge for Computing for Clean Water

10 year badge for Drug Search for Leishmaniasis

20 year badge for GO Fight Against Malaria

10 year badge for Computing for Sustainable Water

50 year badge for Mapping Cancer Markers

50 year badge for Uncovering Genome Mysteries

100 year badge for Outsmart Ebola Together

5 year badge for FightAIDS@Home - Phase 2

10 year badge for Smash Childhood Cancer

2 year badge for Microbiome Immunity Project


API Returning Conflicting Data

The API is returning conflicting data when multiple requests are used to pull results status. As a work around for "missing" results I do 3 pulls, the first for In-Progress, the second for Returned and the third for All. After completing the 3 pulls I check the see if there are any "missing" results and if so do a last request for modified.

Beginning yesterday morning the pulls have resulted in conflicting, and I believe incorrect, time stamps in the SentTime and ReportDeadline fields.

The number has varied between none and 29 and appears to be random, in that sometimes all are returned OK and other times there is conflicting data.

When I did the pull tonight I get 1 bad. I did a second pull 3 minutes later and got 7 bad (out of ~ 20,000 results).

One of them is for FAHV_1000468_3j3q-8O-P2_2463_ 0. Below is what the Results Status page shows and what was returned.

Results Stats shows:
FAHV_ 1000468_ 3j3q-8O-P2_ 2463_ 0-- I73770K-1 Valid 1/7/17 16:10:53 1/7/17 18:31:45 0.06 / 0.06 2.8 / 2.8

API returned 2 records for this WU the first in "Returned" request and the second in "All" request.

The Returned request returned SentTime=1/7/17 16:10:53 , ReportDeadline=1/17/17 16:10:53 and ReceivedTime=1/7/17 18:31:45

The All request returned SentTime=1/7/17 18:57:18 , ReportDeadline=1/17/17 18:57:18 and ReceivedTime=1/7/17 18:31:45

All other fields in both records are the same.

The pull request is: https: //secure.worldcommunitygrid.org/api/members/foxfire/results?code=ver_code&Offset=0&ServerState=x&limit=250&format=json

where ServerState is 4 for In-Progress, 5 for returned and omitted for All and Offset incremented until end of results.

Edit: added total results pulled

----------------------------------------

----------------------------------------
[Edit 1 times, last edit by foxfire at Jan 8, 2017 3:46:15 AM]

[Jan 8, 2017 3:43:56 AM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:

180 day badge for Human Proteome Folding

90 day badge for Human Proteome Folding - Phase 2

45 day badge for Help Cure Muscular Dystrophy - Phase 2

90 day badge for Computing for Clean Water

14 day badge for Uncovering Genome Mysteries

45 day badge for Outsmart Ebola Together

180 day badge for FightAIDS@Home - Phase 2

1 year badge for Microbiome Immunity Project

1 year badge for Africa Rainfall Project

180 day badge for OpenPandemics - COVID-19


Re: API Returning Conflicting Data

foxfire,

With 20,000 results, my guess is that a result is being validated or removed from the database during the execution of your series of queries. This would change the results that are returned and would change the position of the results.

Can you add a check to your queries to see if "ResultsAvailable" changes during a set of queries?

If there is a change in that value during the iteration of your queries, then I suspect that is the root cause and we will have to think on what to do to help you.

Once we address this part, then we will take a look at the different data part of this since that one is odd and I wonder if it is somehow an artifact of of the first problem.

thanks,
Kevin

[Jan 9, 2017 10:29:27 PM]

foxfire
Advanced Cruncher
United States
Joined: Sep 1, 2007
Post Count: 121
Status: Offline
Project Badges:


Re: API Returning Conflicting Data

Will do and will let you know if I get a change and conflicting data. FYI, the number of ResultsAvailable usually does change during my pulls, but I'll see if this problem shows up when it does change.

Also, I don't know if it will help in tracking this down but until DB maint was applied Feb 28, 2015 the API was rock solid. Following that maint application I had to recode my pulls in an attempt to pick up "missing" results.

Thanks for the help!!

----------------------------------------

[Jan 9, 2017 11:14:41 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: API Returning Conflicting Data

At a return rate of. e.g 1 per 8 seconds and a pull rate of 250 limit per second, that's the speed of return, there no hope in heck of ever getting a non-shifted series of fetches without incurring either doubles or missed when it involves 20000 with serverstate 5. In short such a series shows very often a different resultsavailable value from one 250 fetch to the next. That's the reason I've coded for 2 passes to catch the missed and eliminate any doubles and still the method is not watertight, which is why I'm coding for a different strategy now that the modtime filter function works. Be prepared for lots more hits from my end,... the shorter the results status retention, the more frequent. At 24000 a day I'll probably be pulling latest 2000 every hour, compare result name and mod time to then keep newest if the mod time changed or ditch any that are duplicate.

Would the API have included the unique db serial number, the number one sees when hovering a WU link on the results status pages, sorting and filtering could be lots more efficient.

Edit: The ending number in this link https://www.worldcommunitygrid.org/ms/device/....do?workunitId=1910194704 I. E. the workunitId which I'm guessing is the seed number for all copies in a distribution.

----------------------------------------
[Edit 2 times, last edit by Former Member at Jan 9, 2017 11:50:52 PM]

[Jan 9, 2017 11:39:08 PM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: API Returning Conflicting Data

I just deployed an update to the api that does two things:

1) The orderBy will behave basically as it did before, but there will be an additional last order by clause that sorts based on the database generated id of the result. This means that if the default (sent time) order by is used or if you specify another order by field, then if there is a tie in that field, then it will be consistently sorted by id.

2) The JSON and XML versions now included an extra field called 'id' which is the database generated id for that result.

Hope this helps

----------------------------------------
[Edit 1 times, last edit by knreed at Jan 10, 2017 3:27:54 AM]

[Jan 10, 2017 3:27:36 AM]

foxfire
Advanced Cruncher
United States
Joined: Sep 1, 2007
Post Count: 121
Status: Offline
Project Badges:


Re: API Returning Conflicting Data

If the database id is unique and not reused it will solve my problems.
I'll post if the conflicting data happens again.
Thank you again for your help.

----------------------------------------

[Jan 10, 2017 4:13:57 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: API Returning Conflicting Data

Very promising, given that the ID maintains a chronology,, the higher the number the newer the result at generation time, thus the lookup and match functions not having to pass through top to bottom to see if a result is already on archive and with what statÃ©s. Kind of a narrow down method the way we open a dictionary and then home in on the word looked for.

[Jan 10, 2017 8:11:58 AM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: API Returning Conflicting Data

Edit: The ending number in this link https://www.worldcommunitygrid.org/ms/device/viewWorkunitStatus.do?workunitId=1910194704 I. E. the workunitId which I'm guessing is the seed number for all copies in a distribution.

The finding is, not quite. Unique the workunitid may be, but see that for an MCM a different range is used as for FAHV which seem to be in the 1.18-1.19 billion numbering series such as 1189504048. More testing needed here, but the performance improvement is still expected to be significant.

[Jan 10, 2017 11:42:27 AM]

knreed
Former World Community Grid Tech
Joined: Nov 8, 2004
Post Count: 4504
Status: Offline
Project Badges:


Re: API Returning Conflicting Data

Sekerob,

The id returned in the API is the id of the result which is one of the copies of the workunit. The workunit has a different id. If the workunit id would be of use as well we can include it as well.

In both cases, the id's are generated sequentially as new records are created on the workunit or result tables.

[Jan 10, 2017 2:00:21 PM]

Former Member
Cruncher
Joined: May 22, 2018
Post Count: 0
Status: Offline


Re: API Returning Conflicting Data

Always 'assumed' [the standard pitfall] that the number we see at end of the WU link was one since start of WCG i.e. whilst we're racing towards 3.2 billion validated, the 1910194704 was the 1.9 billionth project task [then with it's suffixes indication the nth copy of that WU]. While it could be convenient to be able to link the API Id number to the one seen on the RS pages, it's at least to me of no real use, any permanently unique and incrementing number for my archive is as good as any (unawares there was a second series).

The 'narrowing down' method in tests can increase according speed of finding a match by a factor 1000, yes 1 thousand [the more records, the bigger the exponentiation], and currently by far the biggest bog down when recomputing in full a workbook that eats 1.4Gb RAM. Add to that the ModTime method, each hour just and -only- fetching anything changed/added in the past 3600 seconds + little extra to cover the serial fetches time [1x250 per second, 2000 needing 8 seconds], is bound to improve the execution times massively (largest live archive tested was 410,000 results). Now it's a case of applying the necessary changes to the code and robustifying. The end goal, not missing a single 'credited' result.

thx for your swift action :)

P.S. "The id of the result which is one of the copies of the workunit" is probably better as when you happen to get 2 or more copies spread over multiple machines for the same member [the bigger the farm, the greater the chance], the WU number could cause a duplicate index number, i.e. that key could not be set to unique, whilst it does make identifying such a situation easier. A rarity but for the few.

[Jan 10, 2017 2:48:37 PM]

[ ]