Skip to content

Allow VM that has never started to have volumes attached#3276

Merged
yadvr merged 9 commits into
apache:4.11from
shapeblue:attach-disk-state
May 10, 2019
Merged

Allow VM that has never started to have volumes attached#3276
yadvr merged 9 commits into
apache:4.11from
shapeblue:attach-disk-state

Conversation

@anuragaw
Copy link
Copy Markdown
Contributor

@anuragaw anuragaw commented Apr 15, 2019

With this patch b766bf7
we started tracking disks in attaching state so that other attach request can fail gracefully. However this missed the case where disks were in allocated state but attach was requested.

For the use case where users want to attach disk in allocated state but not ready, we need to have allocated-attaching transition as well. We must take care of returning to the original state - allocated or ready - when attach request has completed.

For the use case of unstarted vm's the disk must proceed as follows - "Allocated" -> Attaching -> Allocated. When VM is started, the disk is "created" and pool is assigned. For the use case of started VMs it's more trivial and disk proceeds as follows - Ready -> Attaching -> Ready.

Test this by creating a VM with "startvm=false", create a disk and try attaching it in allocated state. It would give an exception on latest 4.11 but will be fixed on this patch.

Description

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

Manually (using CloudMonkey, dashboard) and through Marvin Test

Allocated_Attached_Disk

for the use case where users want to attach disk in allocated state
but not ready, we need to have more states for keeping track of
attaching state.

Test this by creating a VM with "startvm=false", create a disk and
try attaching it in allocated state. It would give an exception on
latest 4.11 but will be fixed on this patch.
@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2701

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-3496)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 33075 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3276-t3496-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_internal_lb.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 67 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_02_internallb_roundrobin_1RVPC_3VM_HTTP_port80 Failure 416.00 test_internal_lb.py

@yadvr yadvr changed the title Add more states for attaching disk in allocated state [DO NOT MERGE WIP] Add more states for attaching disk in allocated state Apr 16, 2019
@yadvr yadvr added this to the 4.11.3.0 milestone Apr 16, 2019
@yadvr yadvr requested review from mike-tutkowski and rafaelweingartner and removed request for rafaelweingartner April 16, 2019 11:06
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Apr 16, 2019

@anuragaw can you implement a marvin test for the test-case.

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2704

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan test

@anuragaw
Copy link
Copy Markdown
Contributor Author

@anuragaw can you implement a marvin test for the test-case.

Done @rhtyd , let me know if this covers the case . Thanks!

Comment thread api/src/com/cloud/storage/Volume.java Outdated
UploadError("Volume upload encountered some error"),
UploadAbandoned("Volume upload is abandoned since the upload was never initiated within a specificed time"),
Attaching("The volume is attaching to a VM");
AttachingFromReady("The volume is attaching to a VM from Ready State"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anuragaw don't change the original state Attaching to honour db volume states that may already be in Attaching state in db. This may cause usage issues post an upgrade.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, my bad. Fixing.

handleTargetsForVMware(hostId, volumePool.getHostAddress(), volumePool.getPort(), volume.get_iScsiName());
// volumePool() should be null if the VM we are detaching the disk from has never been started before
// only revoke access on volumes that are actually on a datastore
if (volumePool != null) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the original logic be retained? What happens if volumePool is null?

Copy link
Copy Markdown
Contributor Author

@anuragaw anuragaw Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the volumePool was null initially the dataStore was null in previous implementation and revokeAccess did nothing. That happens after this check as well.
Also I don't think it should be null as the handleTargetsForVMWare(hostId,volumePool.getHostAddress()... would crash.

Comment thread api/src/com/cloud/storage/Volume.java Outdated
UploadAbandoned("Volume upload is abandoned since the upload was never initiated within a specificed time"),
Attaching("The volume is attaching to a VM");
AttachingFromReady("The volume is attaching to a VM from Ready State"),
AttachingFromAllocated("The volume is attaching to a VM from Allocated State");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still fix the issue without introducing a new state. All we're saying is to allow volumes's attaching state whether their previous state is Ready or Allocated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we need a new state is that if an attach fails we want the disk to return to the original state it was in - Ready -> Attaching -> FAIL -> Ready, or Allocated -> Attaching -> FAIL -> Allocated.

Perhaps I am misunderstanding something?

Copy link
Copy Markdown
Contributor Author

@anuragaw anuragaw Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back at the original patch that caused regression, it seems that simple Allocated->Attaching would suffice. I'll change.

Comment thread api/src/com/cloud/storage/Volume.java Outdated
s_fsm.addTransition(new StateMachine2.Transition<State, Event>(Ready, Event.AttachRequested, Attaching, null));
s_fsm.addTransition(new StateMachine2.Transition<State, Event>(Attaching, Event.OperationSucceeded, Ready, null));
s_fsm.addTransition(new StateMachine2.Transition<State, Event>(Attaching, Event.OperationFailed, Ready, null));
s_fsm.addTransition(new StateMachine2.Transition<State, Event>(Allocated, Event.AttachRequested, AttachingFromAllocated, null));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this only and rest can follow the same, i.e. allow both Ready and Allocated state volumes to go into Attaching state on the event.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

self.assertEqual(root_volume.podname, list_pods.name)

@attr(tags = ["advanced", "advancedns", "smoke", "basic"], required_hardware="true")
def test_11_attach_volume_with_unstarted_vm(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test case LGTM. Also check Travis if this passes OK.

@DaanHoogland DaanHoogland changed the title [DO NOT MERGE WIP] Add more states for attaching disk in allocated state [WIP DO NOT MERGE] Add more states for attaching disk in allocated state Apr 18, 2019
@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2711

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@anuragaw
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@anuragaw a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2712

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-3504)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 29605 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3276-t3504-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 68 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-3506)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 29936 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3276-t3506-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Smoke tests completed. 68 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@yadvr yadvr changed the title [WIP DO NOT MERGE] Add proper state transitions for attaching disk in allocated state [WIP DO NOT MERGE] Allow VM that has never started to have volumes attached Apr 23, 2019
Signed-off-by: Rohit Yadav <rohit.yadav@shapeblue.com>
@yadvr
Copy link
Copy Markdown
Member

yadvr commented Apr 23, 2019

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2715

@yadvr
Copy link
Copy Markdown
Member

yadvr commented Apr 23, 2019

@blueorangutan test matrix

@blueorangutan
Copy link
Copy Markdown

@rhtyd a Trillian-Jenkins matrix job (centos6 mgmt + xs71, centos7 mgmt + vmware65, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-3525)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 26397 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3276-t3525-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 68 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@blueorangutan
Copy link
Copy Markdown

Trillian test result (tid-3524)
Environment: xenserver-71 (x2), Advanced Networking with Mgmt server 7
Total time taken: 27352 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3276-t3524-xenserver-71.zip
Intermittent failure detected: /marvin/tests/smoke/test_scale_vm.py
Smoke tests completed. 67 look OK, 1 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_scale_vm Error 17.52 test_scale_vm.py

Copy link
Copy Markdown
Contributor

@borisstoyanov borisstoyanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yadvr yadvr changed the title [WIP DO NOT MERGE] Allow VM that has never started to have volumes attached Allow VM that has never started to have volumes attached Apr 24, 2019
@yadvr yadvr merged commit f9b61bc into apache:4.11 May 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants