Job Description
TCS Cloud Unit is looking for an experienced Observability and AI Solution Architect to design and implement enterprise-grade observability and AI solutions that provide deep visibility into infrastructure, applications, networks and transform IT Operations.
This role requires expertise in leading observability platforms and hands-on experience in IT operations, combined with the ability to integrate AI-driven solutions for IT Operations (AIOps) using cutting-edge technologies such as LLMs, agentic frameworks, and industry-leading platforms like Anthropic, OpenAI, Bedrock, Gemini, and others. This role requires strong customer-facing capabilities, including architecture defense, RFP solutioning, and leadership of onshore and offshore solution and delivery teams.
Responsibilities Include
Provide observability strategies for infrastructure (servers, storage, cloud), applications (microservices, APIs), and networks (LAN/WAN, SD-WAN). Collaborate with DevOps, SRE, and IT operations teams to ensure end-to-end visibility and reliability.
Design and architect to deliver end-to-end AIOps and observability solutions, covering telemetry collection, ingestion, correlation, analytics, dashboards, and operational workflows.
Design and architect AIOps solutions using industry-leading platforms like OpenAI, AWS Bedrock, Google Gemini, Anthropic, and similar technologies. Develop predictive analytics and anomaly detection models to proactively identify and resolve operational issues.
Guidance to onshore and offshore solution teams through requirement understanding, solution creation, estimation, and defining operating model; support RFP solutioning with architecture, roadmap, and estimates.
Design and recommend integrations between monitoring/observability platforms and ITSM tools using APIs and service interfaces.
Integrate observability tools with ITSM platforms and automation workflows. Enable automated root cause analysis and remediation using AI/ML models. Define self-healing and runbook automation
Collaborate with business and IT teams to identify key metrics and integrate them into dashboards and alerting systems.
Establish observability standards, KPIs, and SLAs for performance and availability. Ensure compliance with security and regulatory requirements in monitoring solutions.
Qualifications
10+ years of experience in IT operations, managed services, infrastructure, cloud and application support or transformation roles with significant architecture responsibility.
Strong Hands-on experience implementing AIOps and observability solutions across multiple observability platforms (e.g., Grafana, Datadog, Splunk, Dynatrace, ScienceLogic, Solarwinds).
Strong experience in integration of monitoring, event management, and automation into ITSM platforms and dashboard developments
Strong experience in automation and orchestration driven through AI, including solutioning with scripting techniques
Experience in solution design using GenAI and agentic AI use cases in IT operations ex: automated triage, knowledge generation, and runbook generation, assisted AI.
Communication, offshore management and stakeholder management skills, present, influence, and defend technical solutions with customers and executive audiences.
Salary Range:
$134,400-$181,900 a year
ATS Match is available
1) Upload your resume. 2) Open any job and click Check ATS Match to see your fit score.