We use cookies. Find out about cookies here. By continuing to browse this site you are agreeing to our use of cookies.
Close


Select which account you would like to create.
Forgot your password?
Sign in
Job Seeker Registration
Job Seeker Registration
Employer Registration
Sign in
Job Seeker Registration
Sign in
Employer Registration
the #1 jobs board for UK digital and tech jobs
Featured Jobs
Featured Employers
Advertising Operations Executive vacancy at 1XL
Asthma UK

Site Reliability Engineer - Paid Relocation!

Vacancy has expired

Job Reference:
CS/R001928_1510596146
Job Title:
Site Reliability Engineer - Paid Relocation!
13/11/2017
City:
City of London
Company Name:
Salary Band:
Highly Competitive
Salary Details:
Excellent + Free SkyQ + loads more!
Job Level:
Manager / Mid-Level
Job Type:
FULL_TIME
Location / Region:
LONDON
Central London
Closing Date:
11/12/2017

At Sky Betting & Gaming we don't have Teams we have Tribes. Each Tribe is made up of small, agile and autonomous squads who work collaboratively with a shared purpose.

So whether you join our Bet Tribe, Early Careers Tribe or anything in between, you will work with a group of people with raw energy, natural talent, and the kind of spirit that helps us think big, act bold, and change the game.

It all means that we've created the kind of workplace that wins awards such as a place on the prestigious list of Sunday Times 100 Best Companies to Work For.

Ensure our customers get the best quality of service and uptime we can give them. Identify where we can expect and how we can tolerate failures from our systems as well as those we depend upon. Work closely with our developers and architects to build and run services and systems that respond consistently to failures by gracefully degrading our services.

Be responsible for ensuring the systems and applications we launch remain available, reliable and efficient at accomplishing their duties even as their duties scale and evolve. To be involved in every part of our site, from conception of products and their development to deployment, troubleshooting and analysis.

Design, build and automate tools and processes to ensure and improve scalability, availability and performance across areas of technology. Build, integrate and run tools to inject, predict and identify infrastructure and service failures on an ongoing basis to help optimize our sites.

You will use primarily using open source technologies and products in a LAMP environment, so you'll have extensive commercial experience in supporting and developing high volume commercial web sites using object orientated PHP and MySQL.

Data will underpin your decisions and you will take care to ensure qualitative metrics are held in as high regard as quantitative.

Optimize availability, stability and performance of services

  • Work with our developers and architects to design and integrate systems that respond consistently to failures by gracefully degrading our services.
  • Develop tools and procedures to be able to manage demand on our systems when that demand is too high e.g. degrading services gracefully, user prioritization, removing low priority traffic, intelligent banners.
  • Measure the capability of our infrastructure and applications to manage failures from failovers to full site outages. Make recommendations to the business on the levels of service that can be supported during different failure scenarios.
  • Execute regular testing and measurement of our infrastructure and platforms to identify improvements in their reliability e.g. DR, performance and security testing.
  • Design and run regular testing of applications in an off duty state (e.g. located on standby DR site, behind bannered services) to ensure they perform both functionally and from a performance standpoint.
  • Instigate planned and spontaneous "fire drills" to continually test our systems ability to deal with failures and identify weak points that need improving.

Refine and influence system design and implementation

  • Enable and support the growth and scaling of products and services. Identifying inefficiencies in our current systems and planning for growth in those new and old.
  • Be a key driver for operational excellence across the SDLC and work with our feature squads to ensure best practices around performance, deployment, monitoring and availability
  • Applying data-driven analysis to drive engineering decisions.
  • Minimize the level of manual tasks on our engineers by finding and automating inefficiencies

Build and run tools to identify, predict and mitigate failures

  • Design, build and implement tools to aid the fault finding and debugging of incidents that occur in the deployment and running of applications and systems.
  • Introduce and maintain tools that help measure the resilience of our applications and infrastructure to help them better tolerate failures.
  • Engineer chaos tools and procedures to inject failure into our systems to certify that they are fault tolerant and recoverable.
  • Monitor, analyse and predict service performance and capacity to proactively forecast problems. Apply engineering knowledge in developing or providing tools for anomaly detection and failure prediction.

Operational Support

  • Collaborate with our other engineering teams and lead the triage of high priority production incidents while bringing about changes to improve reliability.
  • Provide technical guidance for service upgrades, rollouts and enhancements.
  • Utilise tools and intuition to aid support teams in the identifying and mitigation of potential problems and vulnerabilities.
  • Develop engineering solutions to failures and all other problems that adversely affect site reliability and uptime. Including capacity, performance, stability and security issues.

The role is multi-disciplinary and benefits from having an varying understanding in the following areas:

  • We are a RHEL/CentOS house so a very good understanding of Linux is essential.
  • We have some typical LAMP stacks, though Mongo, Redis, Memcached and RabbitMQ also feature highly.
  • We write our code in PHP and Javascript, making heavy use of Node.js. There's the usual mixture of bash, a little Python, and some Ruby. Our source control is Git.
  • We make heavy use of Chef for our configuration management but experience of this or other CM tools is necessary.
  • We have heavy integration with OpenBet systems underpinning our sportsbook and gaming services.
  • We make use of Graphite, Grafana, New Relic, Splunk and Opsview for monitoring out services.

We also offer an attractive relocation package for candidates that live outside the Yorkshire region, including those outside the UK.

Our People Ambition is to attract & develop diverse & talented people to meet the current and future growth needs of SB&G. Together, our aim is to create the Best Digital Business to work at.

People who viewed this job also viewed:
  • Motion Graphics Designer
    Leeds
    It's a workplace like no other, with a mission to make Betting & Gaming better....
  • Advice and Content Manager
    London
    Focused on content planning, creation, quality control and promotion 
  • Delivery Lead (Gaming Tribe)
    Leeds
    Focussed on the successful delivery of quality products using the most effective Lean & Agile...

Popular Job Areas: Digital Marketing Jobs | Graphic Design Jobs | SEO Jobs | Content Jobs | Digital Advertising Jobs | Social Media Jobs | Media Jobs | Account Management Jobs | Project Management Jobs | Digital Consulting Jobs | Analytics and CRM Jobs | Sales Jobs | eCommerce Jobs | User Interface Jobs | User Experience Jobs | Mobile Applications Jobs | Games Development Jobs | Web Development Jobs

Popular Cities: Jobs in London | Jobs in Manchester | Jobs in Leeds | Jobs in Birmingham | Jobs in Brighton | Jobs in Bristol | Jobs in Cambridge | Jobs in Cardiff | Jobs in Edinburgh | Jobs in Leicester | Jobs in Oxford | Jobs in Reading

Copyright © Bubble Jobs Ltd, 2011 - 2017, All Rights Reserved | Powered by JobMount Job Board Software