Closed Bug 772467 Opened 13 years ago Closed 13 years ago

Figure out (and fix!) stale buildslave connections in AWS

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

Details

(Whiteboard: [ec2])

Attachments

(1 file)

give up on old slave connections 13 years ago Chris AtLee [:catlee] 2.13 KB, patch	dustin : review+ catlee : checked-in+	Details \| Diff \| Splinter Review

Chris AtLee [:catlee]

Assignee

Description

•

13 years ago

If you reboot a build slave in AWS without first shutting off buildbot, the master doesn't know that the old instance disconnected, and will prevent the new instance from connecting forever.

Chris AtLee [:catlee]

Assignee

Comment 1

•

13 years ago

Attached patch give up on old slave connections — Details — Splinter Review

I'm not at all sure why the AWS slaves hit this problem more frequently than our other machines, but this patch seems to work around the issue. Instead of relying on callRemote() to cause the old tcp session to die, we add a timeout (30s here), and if we haven't heard back from the slave before the timeout we disconnect it. This then allows the next slave connection to succeed.

Attachment #640910 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

13 years ago

Comment on attachment 640910 [details] [diff] [review] give up on old slave connections lgtm. getPeer includes both the remote IP and port, so a collision is unlikely (since slaves don't re-use ports)

Attachment #640910 - Flags: review?(dustin) → review+

Chris AtLee [:catlee]

Assignee

Comment 3

•

13 years ago

Comment on attachment 640910 [details] [diff] [review] give up on old slave connections landed on bm35 only for now

Attachment #640910 - Flags: checked-in+

Chris AtLee [:catlee]

Assignee

Updated

•

13 years ago

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Chris AtLee [:catlee]

Assignee

Comment 4

•

13 years ago

Not sure if we want or need this on other buildbot masters as well?

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Figure out (and fix!) stale buildslave connections in AWS

Categories

(Release Engineering :: General, defect, P2)

Tracking

(Not tracked)

People

(Reporter: catlee, Assigned: catlee)

References

Details

(Whiteboard: [ec2])

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Updated

Updated

Attachment

General

Description

File Name

Content Type