Job Description
Key Responsibilities:
- Design, build, and maintain data pipelines across on-prem Hadoop and AWS
- Develop and maintain Java applications, utilities, and data processing libraries
- Manage and enhance internal Java libraries used for ingestion, validation, and transformation
- Migrate and sync data from on-prem HDFS to AWS S3
- Develop and maintain Airflow DAGs for orchestration and scheduling
- Work with Kafka-based streaming pipelines for real-time/near-real-time ingestion
- Build and optimize Spark / PySpark jobs for large-scale data processing
- Use Hive, Presto/Trino, and Athena for querying and validation
- Implement data quality checks, monitoring, and alerting
- Support Iceberg tables and AWS external tables
- Troubleshoot production issues and ensure SLA compliance
- Collaborate with platform, analytics, and observability teams
Technical Skills Required
Java (Development, maintenance, build tools like Gradle)
AWS (S3, Glue, EMR, Athena, EKS basics)
Hadoop/HDFS, Hive
Apache Kafka (producers/consumers, topics, streaming ingestion)
Apache Spark / PySpark (batch + streaming processing)
Apache Airflow (DAG development and maintenance)
Python
Git and CI/CD workflows
Observability tools (Prometheus/Grafana)
SQL
Location: Austin, TX/ Sunnyvale, CA
Salary Range:$70,000-$135,000 a year