AWS Incident Management Specialist Job at JustinBradley, Reston, VA

bjVORVQzc2FJb05XZWh5dThmZVZZNGY3SWc9PQ==
  • JustinBradley
  • Reston, VA

Job Description

JustinBradley’s client, a leading source of mortgage financing, is seeking an AWS Incident Management Specialist to join their team and manage IT production incidents to resolution in a 24/7/365 environment using our client’s incident management processes. You will guide incident triage calls from a technical perspective, utilize monitoring tools and dashboards to aid troubleshooting, share technical insights, outline resolution activities, and drive improvements in incident management processes. You will also provide regular status updates to stakeholders, assist with postmortem activities, and support efforts related to operational enhancements and application maintenance in production.

Key Responsibilities:

  • Incident Management : Lead and manage IT production incidents to resolution using incident management processes. Communicate the incident status, impact, and resolution actions effectively to stakeholders. Participate in triage calls and manage incident response in a timely and accurate manner.
  • AWS Expertise : Utilize hands-on experience managing and monitoring AWS-based applications. Troubleshoot and resolve incidents related to AWS cloud infrastructure (EC2, ELB, RDS, Redshift, DynamoDB, Aurora, Route53, ECS, Lambda, S3, CloudWatch, WAF, etc.) in real-time.
  • Performance Engineering : Conduct performance engineering for AWS cloud applications. Utilize tools like Dynatrace and Splunk for transaction-level monitoring and troubleshooting. Leverage AWS tools and resources to analyze and resolve incidents promptly.
  • Monitoring Tools Management : Manage and monitor AWS cloud applications and underlying infrastructure using monitoring tools like Extrahop, SolarWinds, Netcool, Catchpoint, MoogSoft, and others. Analyze dashboards and monitoring data to identify trends and patterns in application performance and health.
  • Incident Triage & Resolution : Lead and guide technical incident triage calls, analyze various components of the infrastructure (AWS, UNIX, DNS, LDAP, SSL, etc.), and perform detailed root-cause analysis using wire data analytics, event correlation, and performance management tools.
  • Documentation & Postmortems : Assist with the creation of Root Cause Analysis (RCA) and Correction of Errors (COE) documentation. Participate in postmortem activities and recommend improvements to prevent future incidents. Ensure effective follow-up on items that could negatively impact production operations.
  • Process Improvement : Recommend and implement improvements to incident management processes. Provide recommendations on process changes, create reports, and respond to ad-hoc requests from senior management.
  • On-Call Support : Participate in an on-call rotation, working nights, weekends, and holidays as required to provide continuous support for incident management and resolution.
  • Stakeholder Communication : Report incident details and metrics to senior leadership. Effectively communicate complex technical issues to non-technical stakeholders.

Education & Experience:

  • Education : Bachelor’s Degree or equivalent required.
  • Experience : Minimum of 6 years of relevant experience managing IT incidents and troubleshooting in a cloud environment, particularly AWS.

Specialized Knowledge & Skills:

  • Extensive experience managing AWS cloud environments, including services like EC2, RDS, Lambda, DynamoDB, CloudWatch, and more.
  • Hands-on experience troubleshooting infrastructure and application incidents on AWS.
  • Experience with transaction-level monitoring using tools like Dynatrace and Splunk.
  • Expertise in analyzing various components of the application and infrastructure, including AWS, UNIX, LDAP, DNS, SSL, and databases (Oracle/MS SQL).
  • Proven ability to manage complex incidents and lead triage calls with cross-functional technical teams.
  • Strong communication skills, including the ability to convey technical details to non-technical stakeholders.
  • Ability to multi-task and perform well under pressure in high-stress situations.
  • Familiarity with monitoring and observability tools such as SolarWinds, Extrahop, MoogSoft, and Catchpoint.
  • AWS certifications (e.g., AWS Solution Architect – Associate or higher) preferred.

Preferred Qualifications:

  • Familiarity with tools like CloudFormation or Terraform.
  • Experience troubleshooting Middleware products in UNIX/Linux environments and knowledge of Service Oriented Architecture (SOA), Java, etc.
  • Exposure to other cloud platforms like Azure or Google Cloud.
  • Experience with OpenTel and monitoring dashboards for incident detection and alerting.

Work Environment:

  • 24/7/365 operational support environment.
  • Ability to work various shifts, including nights, weekends, and holidays, as required.

JustinBradley is an EO employer - Veterans/Disabled and other protected employees.

Job Tags

Holiday work, Shift work, Night shift,

Similar Jobs

Loloi Rugs

Senior Textile Product Designer Job at Loloi Rugs

 ...About Us: Loloi Rugs is a leading textile brand that designs and crafts rugs, pillows, and throws for the thoughtfully layered home. Family-owned and led since 2004, Loloi is growing more quickly than ever. To date, weve expanded our diverse team to hundreds of employees... 

Snooze

Head Chef Job at Snooze

 ...it does for our\ndishes: the unexpected twists are what makes them so special. The Head\nChef Role at Snooze As a Snooze\nHead Chef, you are responsible for ensuring the highest quality of food and\nstandards in and out of the Heart of House (kitchen or BOH). Your... 

TALENT Software Services

Scheduling Assistant Job at TALENT Software Services

Are you an experienced Scheduling Assistant with a desire to excel? If so, then Talent Software Services may have the job for you! Our client is seeking an experienced Scheduling Assistant to work at their company in Lancaster, PA. Primary Responsibilities/Accountabilities...

Yochana

Contract role: Scrum Master with Strong Healthcare exp at Bloomfield, CT (Onsite from Day 1) - Only Locals Job at Yochana

 ...Scrum Master Bloomfield, CT (Onsite from Day 1) Long Term Contract Skills req: Agile, Scaled Agile, Healthcare Key Responsibilities: Facilitate Scrum ceremonies and ensure adherence to agile principles. Coordinate project activities, resources, and... 

StretchLab South Sarasota

Personal Trainer/Massage Therapist/Yoga Instructor Job at StretchLab South Sarasota

StretchLab is seeking personal trainers, massage therapists, physical therapists, and dance/yoga/Pilates Instructors to join our team. This is an amazing opportunity to gain experience in a new modality that is taking the fitness industry by storm. StretchLab is the ...