14:30:16 <ttereshc> #startmeeting Pulp Triage 2017-08-18 14:30:16 <ttereshc> !start 14:30:16 <ttereshc> #info ttereshc has joined triage 14:30:16 <pulpbot> Meeting started Fri Aug 18 14:30:16 2017 UTC and is due to finish in 60 minutes. The chair is ttereshc. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:30:16 <pulpbot> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:30:16 <pulpbot> The meeting name has been set to 'pulp_triage_2017_08_18' 14:30:16 <pulpbot> ttereshc: ttereshc has joined triage 14:30:21 <daviddavis> !here 14:30:21 <daviddavis> #info daviddavis has joined triage 14:30:21 <pulpbot> daviddavis: daviddavis has joined triage 14:30:24 <mhrivnak> elijah_d ok we can chat after triage. 14:30:26 <mhrivnak> #info mhrivnak has joined triage 14:30:26 <mhrivnak> !here 14:30:26 <pulpbot> mhrivnak: mhrivnak has joined triage 14:30:36 <bizhang> !here 14:30:36 <bizhang> #info bizhang has joined triage 14:30:37 <pulpbot> bizhang: bizhang has joined triage 14:30:38 <ttereshc> !next 14:30:39 <bmbouter> !here 14:30:40 <ttereshc> #topic Unable to sync docker repo because worker dies - http://pulp.plan.io/issues/2966 14:30:40 <bmbouter> #info bmbouter has joined triage 14:30:40 <pulpbot> ttereshc: 3 issues left to triage: 2966, 2979, 2985 14:30:41 <pulpbot> Issue #2966 [NEW] (unassigned) - Priority: Normal | Severity: High 14:30:42 <pulpbot> Unable to sync docker repo because worker dies - http://pulp.plan.io/issues/2966 14:30:43 <pulpbot> bmbouter: bmbouter has joined triage 14:30:44 <elijah_d> mhrivnak, ok 14:30:49 <dkliban> !here 14:30:49 <dkliban> #info dkliban has joined triage 14:30:49 <pulpbot> dkliban: dkliban has joined triage 14:31:00 <dalley> !here 14:31:00 <dalley> #info dalley has joined triage 14:31:00 <pulpbot> dalley: dalley has joined triage 14:31:08 <ttereshc> mhrivnak, were you talking about this issue with elijah_d ? 14:31:30 <ttereshc> yeah, I see, so skip it for now? 14:31:38 <mhrivnak> Yes. I just want to try digging in a bit more to see if we can figure out where sigkill is coming from. 14:31:48 <ttereshc> !propose skip 14:31:48 <ttereshc> #idea Proposed for #2966: Skip this issue for this triage session. 14:31:48 <pulpbot> ttereshc: Proposed for #2966: Skip this issue for this triage session. 14:31:55 <bmbouter> try using strace on the pid receiving the sigkill 14:31:58 <mhrivnak> Yeah, I think that's fine. We'll look at it today. 14:32:11 <ttereshc> !accept 14:32:11 <ttereshc> #agreed Skip this issue for this triage session. 14:32:11 <pulpbot> ttereshc: Current proposal accepted: Skip this issue for this triage session. 14:32:12 <pulpbot> ttereshc: 2 issues left to triage: 2979, 2985 14:32:12 <mhrivnak> I don't think strace will work, because sigkill doesn't go to the process. 14:32:13 <ttereshc> #topic Celery workers may deadlock when PULP_MAX_TASKS_PER_CHILD and mongo replica set are used - http://pulp.plan.io/issues/2979 14:32:13 <pulpbot> Issue #2979 [NEW] (unassigned) - Priority: Normal | Severity: High 14:32:14 <pulpbot> Celery workers may deadlock when PULP_MAX_TASKS_PER_CHILD and mongo replica set are used - http://pulp.plan.io/issues/2979 14:32:19 <ipanova> !here 14:32:19 <ipanova> #info ipanova has joined triage 14:32:19 <pulpbot> ipanova: ipanova has joined triage 14:32:21 <mhrivnak> But there are some other kernel audit tricks that might help. :) 14:32:30 <bmbouter> oh yeah sigkill goes to init 14:32:38 <bmbouter> so I commented on this one 14:32:58 <bmbouter> I did not comment that both katello and sat will be affected by this b/c they both use that MAX_TASKS... option 14:33:39 <ipanova> mhrivnak: i was trying to track down the sigkill with audit and so far no luck 14:33:40 <bmbouter> fixing this will be a significant change to how our celery stuff runs, I think the only way is to have celery stop forking 14:34:00 <mhrivnak> ipanova ok, let's compare notes in a bit. 14:34:20 <bmbouter> and we need one more piece of info, which is: is the postgresql client driver single threaded or not? 14:34:30 <bmbouter> I'm not sure, but if it is then not only will pulp2 have this problem but also pulp3 14:34:49 <daviddavis> 349766 14:34:53 <daviddavis> oops 14:34:55 <mhrivnak> Is there a recommendation we can make for being able to identify when one of them is deadlocked? 14:35:09 <bmbouter> there is not unfortunately 14:35:19 <bmbouter> perhaps they could count the threads 14:35:49 <bmbouter> but that could also be unreliable since the thread count changes during operation 14:36:10 <bmbouter> really you need a core dump to look at specifically how many pymongo threads are in existance 14:36:29 <mhrivnak> But if it makes it far enough past the fork to be spawning extra threads, has it effectively dodged the bullet? 14:37:09 <mhrivnak> Anyway, maybe that's getting into the weeds. 14:37:13 <ichimonji10> elijah_d++ 14:37:14 <pulpbot> ichimonji10: elijah_d's karma is now 10 14:37:22 <bmbouter> it has, but the problem is that before it makes it that far, you can't know if it will make it that far or if its deadlocked already 14:37:23 <elijah_d> ichimonji10, thanks! 14:37:29 <mhrivnak> If there's a way to help people identify this in the mean time, that could mitigate the severity. 14:37:38 <bmbouter> there isn't a reliable way 14:38:52 <mhrivnak> So options seem to be 1) make celery stop forking, 2) figure out how to delay database access during startup until post-fork, 3) wait for pulp 3? 14:39:18 <mhrivnak> not suggesting any of those are good or easy of course. :) 14:39:26 <bmbouter> yeah but they are clear options 14:39:39 <bmbouter> so I don't want to take this on the sprint, but I think we kind of need to 14:39:47 <bmbouter> at least to understand if pulp3 will be affected or not 14:40:10 <bmbouter> option (2) can't be done without rearchitecting our crash-recover scenarios 14:40:18 <mhrivnak> I suggest we accept it but not take it on this sprint. 14:40:34 <mhrivnak> I think we can get to the other side of the plugin api work and then look at this. 14:40:47 <bmbouter> that is also what I want, but consider this 14:41:02 <bmbouter> katello and sat are both enabling the MAX_TASKS_.... options 14:41:04 <bmbouter> option 14:41:19 <bmbouter> so they will experience rare deadlocking by doing that 14:41:41 <bmbouter> that is the only thing that makes me think we should do more (even though I want to focus on the plugin API for pulp3) 14:41:42 <daviddavis> ugg 14:42:36 <mhrivnak> Gotcha. If you want to pursue this on this sprint, that's fine with me. 14:42:55 <ttereshc> so it looks like it was enabled in 6.2.7 14:43:00 <mhrivnak> It seems like any solution is likely to have a long timeline, but it doesn't hurt to start quickly. 14:43:14 <ttereshc> it's been a while and no reports so far 14:43:23 <bmbouter> whatever you all decide is fine w/ me 14:43:28 <ttereshc> (I mean maybe we still can wait till the next sprint) 14:43:31 <bmbouter> I wanted to just provide the scope of impact, etc 14:43:47 <bmbouter> also I can advise about how to fix (stop forking) but I don't plan to take as assigned 14:43:50 <bmbouter> either way 14:44:04 <ttereshc> !propose accept and add to sprint 14:44:04 <pulpbot> ttereshc: propose accept Propose accepting the current issue in its current state. 14:44:13 <ttereshc> !propose other accept and add to sprint 14:44:13 <ttereshc> #idea Proposed for #2979: accept and add to sprint 14:44:13 <pulpbot> ttereshc: Proposed for #2979: accept and add to sprint 14:44:34 <mhrivnak> +1 14:45:03 <ttereshc> !accept 14:45:03 <ttereshc> #agreed accept and add to sprint 14:45:03 <pulpbot> ttereshc: Current proposal accepted: accept and add to sprint 14:45:05 <ttereshc> #topic I can create importers/publishers for any repo while targeting a specific repo URL - http://pulp.plan.io/issues/2985 14:45:05 <pulpbot> ttereshc: 1 issues left to triage: 2985 14:45:06 <pulpbot> Issue #2985 [NEW] (unassigned) - Priority: Normal | Severity: Medium 14:45:07 <pulpbot> I can create importers/publishers for any repo while targeting a specific repo URL - http://pulp.plan.io/issues/2985 14:46:44 <mhrivnak> This one definitely needs fixing. 14:47:00 <ipanova> let's add it to the sprint? 14:47:05 <ttereshc> I guess this issue is valid for any nested endpoints 14:47:09 <dkliban> yeah 14:47:10 <mhrivnak> That works for me. 14:47:15 <ttereshc> not only importers 14:47:16 <daviddavis> +1 14:47:18 <dkliban> +1 14:47:22 <ipanova> +1 14:47:42 <daviddavis> accept, add to sprint, and comment about checking other nested urls 14:47:56 <ttereshc> !propose other accept, add to sprint, and comment about checking other nested url 14:47:56 <ttereshc> #idea Proposed for #2985: accept, add to sprint, and comment about checking other nested url 14:47:57 <pulpbot> ttereshc: Proposed for #2985: accept, add to sprint, and comment about checking other nested url 14:48:03 <ttereshc> !accept 14:48:03 <ttereshc> #agreed accept, add to sprint, and comment about checking other nested url 14:48:03 <pulpbot> ttereshc: Current proposal accepted: accept, add to sprint, and comment about checking other nested url 14:48:04 <pulpbot> ttereshc: No issues to triage. 14:48:08 <ttereshc> !end 14:48:08 <ttereshc> #endmeeting