مهندس موثوقية الموقع الأول (SRE)

Senior Site Reliability Engineer (SRE)

سلةمكة المكرمة٢١‏/١‏/٢٠٢٦

تقديم الطلب

دوام كامل5+ سنوات خبرة

سلة

وصف الوظيفة

كمهندس موثوقية موقع أول في Salla، ستقود مبادرات الموثوقية، وتتعامل مع الحوادث المعقدة، وتحسن أداء المنصة، وتوجه فرق الهندسة نحو بناء أنظمة مرنة. ستشارك أيضاً في دوران الاستدعاء كجزء من التزامنا بموثوقية المنصة. الموثوقية وإدارة الحوادث - قيادة الاستجابة للحوادث الخطيرة وتحفيز مراجعات ما بعد الحادث - استكشاف الأخطاء وإصلاحها للمشاكل المعقدة عبر التطبيقات والبنية التحتية والشبكات - تحسين MTTR من خلال مراقبة أفضل والتنبيهات وأدوات التشخيص - المشاركة في دوران الاستدعاء لدعم أنظمة الإنتاج الأداء وقابلية التوسع - تحديد وحل اختناقات الأداء وتحديات التوسع - إجراء اختبارات الحمل وتخطيط السعة لسيناريوهات حركة المرور العالية البنية التحتية والعمليات - تحسين البنية التحتية المحلية السحابية وعمليات النشر والأتمتة - تحسين المرونة والتسامح مع الأعطال وآليات الاستعادة عبر الأنظمة الملاحظة - بناء وتحسين لوحات المعلومات والتنبيهات والمقاييس والسجلات والتتبعات - تحديد مؤشرات مستوى الخدمة (SLI) والأهداف (SLO) وتحسين الرؤية في سلوك النظام الأدوات والأتمتة - تطوير الأدوات التي تقلل العبء التشغيلي وتزيد الموثوقية - المساهمة في البنية التحتية كرمز و خطوط أنابيب CI/CD وأنماط GitOps التعاون - العمل بشكل وثيق مع فرق الهندسة لضمان أن الخدمات قوية وجاهزة للإنتاج - توجيه المهندسين حول الموثوقية والتصحيح والممارسات التشغيلية الأفضل المهارات الإضافية - خلفية في الأنظمة الكبيرة والعالية الحركة - خبرة في تصميم متسامح مع الأعطال وأنماط الكوارث والمرونة العالية - الإلمام بـ SLOs و SLIs وميزانيات الأخطاء تفضيل الموقع - يُفضل المرشحون الموجودون في المناطق الزمنية من GMT 0 إلى +6 للمحاذاة مع تعاون الفريق وتغطية الاستدعاء المتطلبات الأساسية - خبرة قوية مع Kubernetes وتقنيات شبكة الخدمات ومنصات السحابة (AWS أو GCP أو Azure) - فهم عميق لـ Linux والشبكات والأنظمة الموزعة وموازنة الحمل - خبرة عملية مع Terraform أو أدوات البنية التحتية كرمز المماثلة - خبرة في منصات الملاحظة مثل Prometheus و Grafana و Loki و Mimir و Elastic أو ما يعادلها - الكفاءة في لغات البرمجة والبرامج النصية مثل Bash و Python و Go - خبرة في خطوط أنابيب CI/CD وممارسات GitOps - مهارات قوية في التصحيح والاستجابة للحوادث وتحليل الأداء

Job Description

As a Senior SRE at Salla, you will lead reliability initiatives, handle complex incidents, improve platform performance, and guide engineering teams toward building resilient systems. You will also participate in the on-call rotation as part of our commitment to platform reliability. Reliability & Incident Management Lead high-severity incident response and drive post-incident reviews. Troubleshoot complex issues across applications, infrastructure, and networks. Improve MTTR through better monitoring, alerts, and diagnostic tooling. Participate in the on-call rotation supporting production systems. Performance & Scalability Identify and resolve performance bottlenecks and scaling challenges. Conduct load testing and capacity planning for high-traffic scenarios. Infrastructure & Operations Enhance cloud-native infrastructure, deployment processes, and automation. Improve resilience, fault-tolerance, and recovery mechanisms across systems. Observability Build and refine dashboards, alerts, metrics, logs, and traces. Define SLIs/SLOs and improve visibility into system behavior. Tooling & Automation Develop tools that reduce operational toil and increase reliability. Contribute to infrastructure-as-code, CI/CD pipelines, and GitOps workflows. Collaboration Work closely with engineering teams to ensure services are robust and production-ready. Mentor engineers on reliability, debugging, and operational best practices. Bonus Skills Background in large-scale, high-traffic systems. Experience with fault-tolerant design, DR, and HA patterns. Familiarity with SLOs, SLIs, and error budgets. Location Preference Candidates located within GMT 0 to +6 time zones are preferred to align with team collaboration and on-call coverage. Strong experience with Kubernetes , service mesh technologies , and cloud platforms ( AWS, GCP, or Azure ). Deep understanding of Linux , networking , distributed systems , and load balancing . Hands-on experience with Terraform or similar Infrastructure-as-Code tools. Experience with observability platforms such as Prometheus, Grafana, Loki, Mimir, Elastic , or equivalent. Proficiency in scripting or programming languages such as Bash, Python, or Go . Experience with CI/CD pipelines and GitOps practices. Strong debugging, incident response, and performance analysis skills.

المهارات المطلوبة

KubernetesTerraformPrometheusGrafanaAWSPythonBashGoGitOpsLinux