We use cookies. Find out about cookies here. By continuing to browse this site you are agreeing to our use of cookies.
Close


Select which account you would like to create.
Forgot your password?
Sign in
Job Seeker Registration
Job Seeker Registration
Employer Registration
Sign in
Job Seeker Registration
Sign in
Employer Registration
the #1 jobs board for UK digital and tech jobs
Featured Jobs
Featured Employers
Advertising Operations Executive vacancy at 1XL
Havas People

Site Reliability Engineer (SRE)

Vacancy has expired

Job Reference:
CS/R000676_1495030462
Job Title:
Site Reliability Engineer (SRE)
31/05/2017
City:
Leeds
Company Name:
Salary Band:
Highly Competitive
Salary Details:
Excellent + Free SkyQ + loads more!
Job Level:
Manager / Mid-Level
Job Type:
FULL_TIME
Location / Region:
YORKSHIRE & HUMBER
Leeds
Closing Date:
14/06/2017

Ensure our customers get the best quality of service and uptime we can give them. Identify where we can expect and how we can tolerate failures from our systems as well as those we depend upon. Work closely with our developers and architects to build and run services and systems that respond consistently to failures by gracefully degrading our services.

Be responsible for ensuring the systems and applications we launch remain available, reliable and efficient at accomplishing their duties even as their duties scale and evolve. To be involved in every part of our site, from conception of products and their development to deployment, troubleshooting and analysis.

Design, build and automate tools and processes to ensure and improve scalability, availability and performance across areas of technology. Build, integrate and run tools to inject, predict and identify infrastructure and service failures on an ongoing basis to help optimize our sites.

You will use primarily using open source technologies and products in a LAMP environment, so you'll have extensive commercial experience in supporting and developing high volume commercial web sites using object orientated PHP and MySQL.

Data will underpin your decisions and you will take care to ensure qualitative metrics are held in as high regard as quantitative.

Key responsibilities for our Site Reliability Engineers:

Optimize availability, stability and performance of services

  • Work with our developers and architects to design and integrate systems that respond consistently to failures by gracefully degrading our services.
  • Develop tools and procedures to be able to manage demand on our systems when that demand is too high e.g. degrading services gracefully, user prioritization, removing low priority traffic, intelligent banners.
  • Measure the capability of our infrastructure and applications to manage failures from failovers to full site outages. Make recommendations to the business on the levels of service that can be supported during different failure scenarios.
  • Execute regular testing and measurement of our infrastructure and platforms to identify improvements in their reliability e.g. DR, performance and security testing.
  • Design and run regular testing of applications in an off duty state (e.g. located on standby DR site, behind bannered services) to ensure they perform both functionally and from a performance standpoint.
  • Instigate planned and spontaneous "fire drills" to continually test our systems ability to deal with failures and identify weak points that need improving.
  • Work with all other tribes to schedule and run the failover of our systems invoking DR and BCP processes as a business.

Refine and influence system design and implementation

  • Enable and support the growth and scaling of products and services. Identifying inefficiencies in our current systems and planning for growth in those new and old.
  • Be a key driver for operational excellence across the SDLC and work with our feature squads to ensure best practices around performance, deployment, monitoring and availability
  • Applying data-driven analysis to drive engineering decisions.
  • Minimize the level of manual tasks on our engineers by finding and automating inefficiencies to avoid extra work in the future.

Build and run tools to identify, predict and mitigate failures

  • Design, build and implement tools to aid the fault finding and debugging of incidents that occur in the deployment and running of applications and systems.
  • Introduce and maintain tools that help measure the resilience of our applications and infrastructure to help them better tolerate failures.
  • Engineer chaos tools and procedures to inject failure into our systems to certify that they are fault tolerant and recoverable.
  • Monitor, analyse and predict service performance and capacity to proactively forecast problems. Apply engineering knowledge in developing or providing tools for anomaly detection and failure prediction

Operational Support

  • Collaborate with our other engineering teams and lead the triage of high priority production incidents while bringing about changes to improve reliability.
  • Provide technical guidance for service upgrades, rollouts and enhancements.
  • Utilise tools and intuition to aid support teams in the identifying and mitigation of potential problems and vulnerabilities.
  • Develop engineering solutions to failures and all other problems that adversely affect site reliability and uptime. Including capacity, performance, stability and security issues.

Skills and Technologies

The role is multi-disciplinary and benefits from having an varying understanding in the following areas:

  • We are a RHEL/CentOS house so a very good understanding of Linux is essential.
  • We have some typical LAMP stacks, though Mongo, Redis, Memcached and RabbitMQ also feature highly.
  • We write our code in PHP and Javascript, making heavy use of Node.js. There's the usual mixture of bash, a little Python, and some Ruby. Our source control is Git.
  • We make heavy use of Chef for our configuration management but experience of this or other CM tools is necessary.
  • We have heavy integration with OpenBet systems underpinning our sportsbook and gaming services.
  • We make use of Graphite, Grafana, New Relic, Splunk and Opsview for monitoring out services.

People who viewed this job also viewed:
  • Advertising (Ad) Operations Executive
    London
    Bring your Ad Ops skills to the UK’s largest publisher co-op
  • Advice and Content Manager
    London
    Focused on content planning, creation, quality control and promotion 
  • Delivery Lead (Gaming Tribe)
    Leeds
    Focussed on the successful delivery of quality products using the most effective Lean & Agile...

Popular Job Areas: Digital Marketing Jobs | Graphic Design Jobs | SEO Jobs | Content Jobs | Digital Advertising Jobs | Social Media Jobs | Media Jobs | Account Management Jobs | Project Management Jobs | Digital Consulting Jobs | Analytics and CRM Jobs | Sales Jobs | eCommerce Jobs | User Interface Jobs | User Experience Jobs | Mobile Applications Jobs | Games Development Jobs | Web Development Jobs

Popular Cities: Jobs in London | Jobs in Manchester | Jobs in Leeds | Jobs in Birmingham | Jobs in Brighton | Jobs in Bristol | Jobs in Cambridge | Jobs in Cardiff | Jobs in Edinburgh | Jobs in Leicester | Jobs in Oxford | Jobs in Reading

Copyright © Bubble Jobs Ltd, 2011 - 2017, All Rights Reserved | Powered by JobMount Job Board Software