Closed Bug 772467 Opened 12 years ago Closed 12 years ago

Figure out (and fix!) stale buildslave connections in AWS

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

Details

(Whiteboard: [ec2])

Attachments

(1 file)

If you reboot a build slave in AWS without first shutting off buildbot, the master doesn't know that the old instance disconnected, and will prevent the new instance from connecting forever.
I'm not at all sure why the AWS slaves hit this problem more frequently than our other machines, but this patch seems to work around the issue.

Instead of relying on callRemote() to cause the old tcp session to die, we add a timeout (30s here), and if we haven't heard back from the slave before the timeout we disconnect it. This then allows the next slave connection to succeed.
Attachment #640910 - Flags: review?(dustin)
Comment on attachment 640910 [details] [diff] [review]
give up on old slave connections

lgtm.  getPeer includes both the remote IP and port, so a collision is unlikely (since slaves don't re-use ports)
Attachment #640910 - Flags: review?(dustin) → review+
Comment on attachment 640910 [details] [diff] [review]
give up on old slave connections

landed on bm35 only for now
Attachment #640910 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Not sure if we want or need this on other buildbot masters as well?
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: