learnings from scaling ironic at yahoo
play

Learnings from scaling Ironic at Yahoo Arun S A G - PowerPoint PPT Presentation

Learnings from scaling Ironic at Yahoo Arun S A G saga@yahoo-inc.com zer0c00l on freenode Yahoo Inc May 08, 2017 https://github.com/sagarun/presentations/ Background and Architecture 1 / 25 Cluster Architecture OOB - ipmi power management


  1. Learnings from scaling Ironic at Yahoo Arun S A G saga@yahoo-inc.com zer0c00l on freenode Yahoo Inc May 08, 2017 https://github.com/sagarun/presentations/

  2. Background and Architecture 1 / 25

  3. Cluster Architecture OOB - ipmi power management V I P mysql ro/ mysql ro/ Target Nodes Console / Tools / Bootbox Server (Existing) Database Node API Node db[1-2].ostk api[1-3].ostk Ironic Conductor API Node Ironic Conductor mysql rw/ https/ nova-api ic[1-2].ostk mysql rw/ nova-api nova-scheduler Horizon (dual connected) nova-scheduler neutron-server neutron-server nova-compute Message Queue nova-compute keystone V mq[1-2].ostk keystone glance-api I zookeeper ensemble (2) dhcp (67), tftp (69) V glance-api glance-registry VIP A nfs -rw P glance-registry I ironic-api ipxe binary via tfp T ironic-api https/ P rpc/ images + ipxe via S rpc/ https zookeeper ensemble (3) 5671, 5671 5672 ats[1-2 VIP http/ ].ostk NFS target User https (various) nfs -rw ITS its[1-2].ostk (10g nic) V nfs - ro neutron-agent NFS glance rpc/5671,5672 I x P ATS proxy to yapache http server https Openstack Mgmt Vlan Out of band network Target inventory Corp vlan Nodes 2 / 25

  4. Migrating to Ironic ◮ Import nodes from old system into Ironic 3 / 25

  5. Migrating to Ironic ◮ Import nodes from old system into Ironic ◮ Create neutron port for the node 3 / 25

  6. Migrating to Ironic ◮ Import nodes from old system into Ironic ◮ Create neutron port for the node ◮ If the node is already active in the old system, ’fake’ boot it with fake_pxe driver 3 / 25

  7. Migrating to Ironic ◮ Import nodes from old system into Ironic ◮ Create neutron port for the node ◮ If the node is already active in the old system, ’fake’ boot it with fake_pxe driver ◮ Once everything is successful, switch to pxe_ipmitool driver 3 / 25

  8. Ironic 4 / 25

  9. Ironic Setup ◮ Ironic API runs behind Apache Server 5 / 25

  10. Ironic Setup ◮ Ironic API runs behind Apache Server ◮ Ironic Conductors(2) 5 / 25

  11. 6 / 25

  12. 7 / 25

  13. What could possibly go wrong? ◮ Ironic Boots started to fail 8 / 25

  14. What could possibly go wrong? ◮ Ironic Boots started to fail ◮ Ironic-conductor was using lot of CPU 8 / 25

  15. What could possibly go wrong? ◮ Ironic Boots started to fail ◮ Ironic-conductor was using lot of CPU ◮ Ironic API calls took too long 8 / 25

  16. Solutions ◮ Sync_Power_State periodic task 9 / 25

  17. Solutions ◮ Sync_Power_State periodic task ◮ Increase the number of Ironic Conductors 9 / 25

  18. Solutions ◮ Sync_Power_State periodic task ◮ Increase the number of Ironic Conductors ◮ Run multiple conductors on the same host 9 / 25

  19. Neutron 10 / 25

  20. Neutron setup ◮ All 3 API servers run neutron-server 11 / 25

  21. Neutron setup ◮ All 3 API servers run neutron-server ◮ 24 API/RPC workers 11 / 25

  22. Neutron setup ◮ All 3 API servers run neutron-server ◮ 24 API/RPC workers ◮ 4 neutron dhcp agents 11 / 25

  23. Neutron setup ◮ All 3 API servers run neutron-server ◮ 24 API/RPC workers ◮ 4 neutron dhcp agents ◮ All networks/subnets are managed by all 4 agents (HA) 11 / 25

  24. Neutron setup ◮ All 3 API servers run neutron-server ◮ 24 API/RPC workers ◮ 4 neutron dhcp agents ◮ All networks/subnets are managed by all 4 agents (HA) ◮ ISC DHCPD driver instead of dnsmasq 11 / 25

  25. What is sync state? 12 / 25

  26. A tale of two drivers ◮ OMShell driver 13 / 25

  27. A tale of two drivers ◮ OMShell driver ◮ Pypureomapi driver 13 / 25

  28. OMShell -bash-4.1$ omshell > server 127.0.0.1 > port 7911 > key keyname secret > connect obj: <null> > new host obj: host > set hardware-address = 00:1c:1a:1d:10:54 obj: host hardware-address = 00:1c:1a:1d:10:54 > open obj: host hardware-address = 00:1c:1a:1d:10:54 ip-address = 0a:d7:a6:b1 name = "hostname.yahoo.com-0" hardware-type = 00:00:00:01 >remove 14 / 25

  29. Sync State with OMShell 15 / 25

  30. Sync State with PypureOMAPI 16 / 25

  31. Where do we go from here? ◮ ISC DHCPD restarts are not ideal 17 / 25

  32. Where do we go from here? ◮ ISC DHCPD restarts are not ideal ◮ VIP thinks dhcpd is down whenever it restarts 17 / 25

  33. Where do we go from here? ◮ ISC DHCPD restarts are not ideal ◮ VIP thinks dhcpd is down whenever it restarts ◮ Move to Kea DHCP Server 17 / 25

  34. Density Test 18 / 25

  35. When did things started to break? ◮ At 24500 nodes, API servers started swapping 19 / 25

  36. Swap and memory usage on API nodes 20 / 25

  37. Memory usage ◮ Neutron the biggest user of memory: 1.4 GB per process 21 / 25

  38. Memory usage ◮ Neutron the biggest user of memory: 1.4 GB per process ◮ Subnets: 2500 Ports: 43000 21 / 25

  39. Memory usage ◮ Neutron the biggest user of memory: 1.4 GB per process ◮ Subnets: 2500 Ports: 43000 ◮ Easy fix: Reduce number of api_workers and rpc_workers 21 / 25

  40. Memory usage ◮ Neutron the biggest user of memory: 1.4 GB per process ◮ Subnets: 2500 Ports: 43000 ◮ Easy fix: Reduce number of api_workers and rpc_workers ◮ Long Term Fix: Investigate memory usage, isolate neutron 21 / 25

  41. Learnings 22 / 25

  42. Learnings ◮ Do a *density* and *scale* testing before taking on production 23 / 25

  43. Learnings ◮ Do a *density* and *scale* testing before taking on production ◮ Avoid spawning processes, try and use native python libraries whenever possible 23 / 25

  44. Learnings ◮ Do a *density* and *scale* testing before taking on production ◮ Avoid spawning processes, try and use native python libraries whenever possible ◮ Pay attention to periodic tasks 23 / 25

  45. Learnings ◮ Do a *density* and *scale* testing before taking on production ◮ Avoid spawning processes, try and use native python libraries whenever possible ◮ Pay attention to periodic tasks ◮ Be prepared to scale horizontally 23 / 25

  46. Learnings ◮ Do a *density* and *scale* testing before taking on production ◮ Avoid spawning processes, try and use native python libraries whenever possible ◮ Pay attention to periodic tasks ◮ Be prepared to scale horizontally ◮ Pay attention to number of workers,conductors,rpc_workers 23 / 25

  47. Learnings ◮ Do a *density* and *scale* testing before taking on production ◮ Avoid spawning processes, try and use native python libraries whenever possible ◮ Pay attention to periodic tasks ◮ Be prepared to scale horizontally ◮ Pay attention to number of workers,conductors,rpc_workers ◮ Don’t forget to have fun :) 23 / 25

  48. Questions 24 / 25

  49. References ◮ Layout and background: https://github.com/mtreinish/openstack-health-presentation ◮ Picture from TV show: http://www.imdb.com/title/tt4338930/ ◮ Picture of explotion: https://en.wikipedia.org/wiki/Explosion 25 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend