AWS Incident Management Specialist Job at JustinBradley, Reston, VA

bjVORVQzc2FJb05XZWh5dThmZVZZNGY3SWc9PQ==
  • JustinBradley
  • Reston, VA

Job Description

JustinBradley’s client, a leading source of mortgage financing, is seeking an AWS Incident Management Specialist to join their team and manage IT production incidents to resolution in a 24/7/365 environment using our client’s incident management processes. You will guide incident triage calls from a technical perspective, utilize monitoring tools and dashboards to aid troubleshooting, share technical insights, outline resolution activities, and drive improvements in incident management processes. You will also provide regular status updates to stakeholders, assist with postmortem activities, and support efforts related to operational enhancements and application maintenance in production.

Key Responsibilities:

  • Incident Management : Lead and manage IT production incidents to resolution using incident management processes. Communicate the incident status, impact, and resolution actions effectively to stakeholders. Participate in triage calls and manage incident response in a timely and accurate manner.
  • AWS Expertise : Utilize hands-on experience managing and monitoring AWS-based applications. Troubleshoot and resolve incidents related to AWS cloud infrastructure (EC2, ELB, RDS, Redshift, DynamoDB, Aurora, Route53, ECS, Lambda, S3, CloudWatch, WAF, etc.) in real-time.
  • Performance Engineering : Conduct performance engineering for AWS cloud applications. Utilize tools like Dynatrace and Splunk for transaction-level monitoring and troubleshooting. Leverage AWS tools and resources to analyze and resolve incidents promptly.
  • Monitoring Tools Management : Manage and monitor AWS cloud applications and underlying infrastructure using monitoring tools like Extrahop, SolarWinds, Netcool, Catchpoint, MoogSoft, and others. Analyze dashboards and monitoring data to identify trends and patterns in application performance and health.
  • Incident Triage & Resolution : Lead and guide technical incident triage calls, analyze various components of the infrastructure (AWS, UNIX, DNS, LDAP, SSL, etc.), and perform detailed root-cause analysis using wire data analytics, event correlation, and performance management tools.
  • Documentation & Postmortems : Assist with the creation of Root Cause Analysis (RCA) and Correction of Errors (COE) documentation. Participate in postmortem activities and recommend improvements to prevent future incidents. Ensure effective follow-up on items that could negatively impact production operations.
  • Process Improvement : Recommend and implement improvements to incident management processes. Provide recommendations on process changes, create reports, and respond to ad-hoc requests from senior management.
  • On-Call Support : Participate in an on-call rotation, working nights, weekends, and holidays as required to provide continuous support for incident management and resolution.
  • Stakeholder Communication : Report incident details and metrics to senior leadership. Effectively communicate complex technical issues to non-technical stakeholders.

Education & Experience:

  • Education : Bachelor’s Degree or equivalent required.
  • Experience : Minimum of 6 years of relevant experience managing IT incidents and troubleshooting in a cloud environment, particularly AWS.

Specialized Knowledge & Skills:

  • Extensive experience managing AWS cloud environments, including services like EC2, RDS, Lambda, DynamoDB, CloudWatch, and more.
  • Hands-on experience troubleshooting infrastructure and application incidents on AWS.
  • Experience with transaction-level monitoring using tools like Dynatrace and Splunk.
  • Expertise in analyzing various components of the application and infrastructure, including AWS, UNIX, LDAP, DNS, SSL, and databases (Oracle/MS SQL).
  • Proven ability to manage complex incidents and lead triage calls with cross-functional technical teams.
  • Strong communication skills, including the ability to convey technical details to non-technical stakeholders.
  • Ability to multi-task and perform well under pressure in high-stress situations.
  • Familiarity with monitoring and observability tools such as SolarWinds, Extrahop, MoogSoft, and Catchpoint.
  • AWS certifications (e.g., AWS Solution Architect – Associate or higher) preferred.

Preferred Qualifications:

  • Familiarity with tools like CloudFormation or Terraform.
  • Experience troubleshooting Middleware products in UNIX/Linux environments and knowledge of Service Oriented Architecture (SOA), Java, etc.
  • Exposure to other cloud platforms like Azure or Google Cloud.
  • Experience with OpenTel and monitoring dashboards for incident detection and alerting.

Work Environment:

  • 24/7/365 operational support environment.
  • Ability to work various shifts, including nights, weekends, and holidays, as required.

JustinBradley is an EO employer - Veterans/Disabled and other protected employees.

Job Tags

Holiday work, Shift work, Night shift,

Similar Jobs

ABB

Field Service Technician Job at ABB

 ...deliver quality customer service planning and execute service work as per customer order. The work model for the role is: Working from home-based office, will travel to customer sites to perform service work as assigned. Must live within 4 hours of Princeton, NJ. This... 

Total Appliance and A/C Repairs Inc.

Universal Home Service Technician Job at Total Appliance and A/C Repairs Inc.

 ...Job Overview: FOR IMMEDIATE HIRE: Universal Home Service Technician (HVAC, Plumbing, Appliance & Electrical) Location: The Villages, Ocala, and surrounding...  ...to diagnose the issue and perform the necessary repairs. Your primary responsibility is to provide a... 

Phyton Talent Advisors

Entry Level Analyst (Finance) Job at Phyton Talent Advisors

 ...Investment Bank, is seeking an Entry Level Analyst in their Jersey City...  .... Analyze financial data and create models to support business...  ...to work independently and as part of a team. Strong written and...  ...Required: Internship or part-time experience in finance,... 

My3Tech

GIS Program Specialist Job at My3Tech

 ...JOB: GIS Program Specialists Location: Shreveport, LA (Remote) Expertise and/or relevant experience in the following areas are mandatory: ESRIs Suite of Desktop and Enterprise Software Usage and development of Desktop software (ArcGIS Desktop and ArcGIS Pro... 

Reliable Nurse Staffing

Travel Pharmacist - $2,703 per week Job at Reliable Nurse Staffing

Reliable Nurse Staffing is seeking a travel Pharmacist for a travel job in Roswell, New Mexico. Job Description & Requirements ~ Specialty: Pharmacist ~ Discipline: Allied Health Professional ~ Start Date: 07/14/2025~ Duration: 13 weeks ~40 hours per week...