Nikhil Sai

Senior

Registration: 01.07.2024

Specialization: Big Data Engineer

— Senior Data Engineer/AWS Data Engineer with over 10+ years of overall experience in design, development, deploying, and large scale supporting large scale distributed systems on on-prem Cloudera CDH/Hortonworks, AWS, and Azure cloud. — Implemented and tested database objects in non-production environments, ensuring seamless transitions to production. — Designed and implemented ETL workflows using Prism to extract data from multiple sources, transform it, and load it into a data warehouse. — Successfully implemented data management solutions using Snowflake for over 5 years, focusing on the complete lifecycle from architecture design to deployment. — Over 3 years of expertise in data ingestion from various message queues, including TIBCO and IBM MQ, ensuring seamless data integration and processing across diverse systems. — Over 5 years of experience in building and deploying applications in AWS, leveraging services such as S3, Glue, Redshift, RDS, and AWS EMR to manage and process large datasets efficiently. — Experience in Implement frameworks to import and export data from Hadoop to RDBMS. — Deep knowledge of troubleshooting and tuning Spark applications and Hive scripts to achieve optimal performance. — Developed and maintained robust messaging systems utilizing Kafka, Amazon MSK, TIBCO EMS, and IBM MQ Series, resulting in a 40% improvement in message throughput and a significant reduction in system latency. — Worked extensively with content management systems, search engines, relational and NoSQL databases, ETL tools, geospatial systems, and semantic technology, ensuring robust data solutions. — Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction, transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover insights into the customer usage patterns, Spark RDD, Spark - SQL and Data frame APIs. — Strong experience with ETL and/or orchestration tools (e.g., Informatica, Oozie, Airflow). — Experienced in automating infrastructure and application deployment using tool such as Ansible. — Worked with real-time data processing and streaming techniques using Spark streaming and Kafka. — Experience in moving data into and out of the HDFS and Relational Database Systems (RDBMS) using Apache Sqoop. — Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data. — Good understanding of the Data modelling (Dimensional & Relational) concepts like Star-Schema Modelling, a Schema Modelling, Fact and Dimension tables. — Database design, modelling, migration, and development experience in using stored procedures, triggers, cursor, constraints, and functions. — Extensive experience working with commercial aviation industry data, including ticketing, booking, itinerary data, schedules, and US DOT aviation data. — Expertise in working with HIVE data warehouse infrastructure-creating tables, data distribution by implementing Partitioning and Bucketing, developing, and tuning the HQL queries. — Experience with Software development tools such as JIRA, Play, GIT. — Experienced in using Agile methodologies including extreme programming, SCRUM and Test-Driven Development (TDD). — Optimized performance of Databricks clusters and jobs by fine-tuning configurations and leveraging best practices in Spark.

Python PL/SQL SQL Scala unix Kafka Agile AWS Tableau Power BI Oracle Teradata

Portfolio

Wells Fargo

Johnson & Johnson

Nationwide

Skills

Python

SQL

PL/SQL

Scala

UNIX

Hadoop

HDFS

Map Reduce

Spark

Airflow

Nifi

HBase

Hive

Pig

Sqoop

Kafka

Oozie

RAD

JAD

SDLC

Agile

AWS

Microsoft Azure

GCP

Tableau

PowerBI

Atscale

Oracle

Teradata

Erwin Data Modeler

ER Studio v17

Work experience

Senior Big Data Engineer / Snowflake Data Engineer

since 04.2022 - Till the present day |Wells Fargo

Hadoop, Spark, Hive, Python, PL / SQL AWS, EC2, EMR, S3, Lambda, Auto Scaling, MapReduce, Data Bricks, Data Lake, Oracle12c, Flat, Snowflake, MS SQL Server, XML, Kafka, Airflow, UNIX, Erwin

● Developed Spark / Scala, Python for regular expression (regex) project in the Hadoop / Hive environment with Linux/Windows for big data resources. ● Developed and designed JavaScript modules and REST APIs utilizing MarkLogic to support complex searches, enhancing enterprise platform integrations. ● Implemented ETL pipelines using Databricks to transform and process large datasets, ensuring efficient data integration and quality. ● Maintained database objects within a software version control system, ensuring version integrity and smooth deployments. ● Expertly migrated data from existing applications and database environments to new architecture, ensuring data integrity and consistency. ● Developed and maintained Databricks jobs to automate data processing workflows, enhancing data accuracy and timeliness. ● Conducted training sessions and provided mentorship on DataVault modeling and automation to junior engineers, fostering a culture of continuous learning and improvement. ● Architected and built Informatica Power Center mappings and workflows, participating in requirements definition, system architecture, and data architecture design. ● Utilized MarkLogic framework and DynamoDB for real-time data processing, combined with advanced analytics tools like SAS, R, Python, and other statistical software. ● Successfully implemented data management solutions using Snowflake for over 5 years, focusing on the complete lifecycle from architecture design to deployment. ● Proficient in handling different file formats such as JSON, XML, and CSV, successfully transforming and ingesting data across multiple platforms to support business intelligence and analytics initiatives. ● Communicated business goals and requirements to create data-driven solutions, enhancing business value. ● Proactively monitored AWS data storage and processing environments, implementing best practices for data security and compliance, and conducted regular performance tuning to ensure optimal operation of data-intensive applications. ● Developed and optimized ABAP Core Data Services (CDS) views to improve query performance and data extraction processes, enabling efficient real-time data access for downstream analytics applications on AWS. ● Designed and implemented robust AWS data pipelines, specializing in extracting and loading data from SAP tables into AWS S3, followed by transformation using AWS Glue, to support scalable data analytics solutions. ● Successfully implemented Delta Live Tables for real-time data processing and created comprehensive data catalogs using Unity Catalog. Enhanced data governance and security, ensuring compliance with industry standards and improving data accessibility and transparency across the organization. ● Developed AWS Lambda functions using Node.js and TypeScript to handle various data processing tasks, ensuring efficient and scalable serverless solutions in production environments. ● Managed production issue triage and implemented preventative measures, providing technical support for release management and maintaining a healthy backlog. ● Actively developing and optimizing real-time data pipelines using Apache Flink to ensure efficient and scalable data processing. ● Conducted research and development of AWS Proof of Concepts (POCs), leveraging AWS services to explore innovative solutions and improve existing processes. ● Collaborated with Application Development Team, Project, Testing/QA, Architects, and IT Management to support fixes, enhancements, and project implementations. ● Created and managed metadata repositories in Databricks, facilitating efficient data cataloging and governance. ● Developed web applications using Angular, JavaScript, and Node.js, employing JSON and XML for data modeling and Git / GitHub for modern source code management. ● Implemented real-time data processing solutions with NodeJS, ensuring timely and accurate data availability for business-critical applications. ● Developed and optimized data pipelines using SAP tables and ABAP CDS views, improving data retrieval efficiency and supporting real-time analytics. ● Developed and maintained data pipelines that integrated Schema Registry for schema evolution and version control. ● Developed an ETL process to ingest large-scale data from multiple sources into a MySQL database. ● Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, T-SQL, and SQL Server using Python. ● Skilled in SAS macros for automating repetitive tasks and enhancing efficiency. ● Used Airflow for scheduling the Hive, Spark and MapReduce jobs. ● Extensively utilized Databricks notebooks for interactive analysis utilizing Spark APIs. ● Extensive experience in designing and implementing data warehouses using SAS. ● Strong expertise in designing, building, and maintaining data pipelines using Amazon Redshift as the data warehousing solution.

Big Data Engineer

07.2020 - 03.2022 |Nationwide

Hadoop, Kafka, Spark, Sqoop, Docker, Azure, Azure HD Insight, Spark SQL, TDD, Spark Streaming, Hive, Scala, Pig, Azure Data Bricks, Azure Data Storage, Azure Data Lake, Azure SQL, NoSQL, Impala, Oozie, Hbase, Data Lake, Zookeeper

● Designing the business requirement collection approach based on the project scope and SDLC methodology. ● Creating Pipelines in ADF using Linked Services / Datasets / Pipeline / to Extract, Transform, and load data from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and backwards. ● Experience with designing and implementing data pipelines and real-time data processing applications using Pub/Sub systems. ● Familiarity with standard Azure security tooling like Microsoft Defender Suite and Sentinel, coupled with expertise in scripting languages such as PowerShell, Python, and Bash for automation and security orchestration. ● Work on data that was a combination of unstructured and structured data from multiple sources and automate the cleaning using Python scripts. ● Implemented CDC mechanisms with Apache Hudi to enable efficient upserts and data consistency in distributed data lake environments, ensuring real-time data availability for downstream analytics. ● Writing a Data Bricks code and ADF pipeline with fully parameterized for efficient code management. ● Designed end to end scalable architecture to solve business problems using various Azure Components like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio. ● Involved in all the steps and scope of the project reference data approach to MDM, have created a Data Dictionary and Mapping from Sources to the Target in MDM Data Model. ● Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to integrate with other Azure Services. ● Developed and Optimized Data Storage Solutions: Designed and implemented scalable data storage architectures using HBASE and Cassandra for a real-time analytics platform, improving data ingestion rates by 30% and reducing query response times by 40% through effective schema design and performance tuning. ● In-depth knowledge of SAS Enterprise Guide for data analysis and reporting. ● Worked with cross-functional teams to troubleshoot and resolve production issues related to data processing, storage, and delivery, ensuring timely resolution of issues. ● Built and managed data infrastructure using Scala and Hadoop, enabling the business to efficiently store and analyze large amounts of data. ● Managed High-Availability NoSQL Clusters: Set up and maintained multi-node HBASE and Cassandra clusters, ensuring high availability and fault tolerance. Implemented monitoring and maintenance strategies, reducing downtime by 25% and achieving 99.9% uptime. ● Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap. ● Installing, configuring, and maintaining Data Pipelines. ● Hands on experience working Databricks and Delta Lake tables. ● Worked extensively on performance tuning of the spark jobs. ● Developed Databricks Python notebooks to Join, filter, pre-aggregate, and process the files stored in Azure data lake storage. ● Built a data warehouse by integrating multiple MySQL databases into a single data store. ● Expertise in using Kubernetes to ensure high availability and scalability for big data applications, allowing for efficient and uninterrupted processing of large volumes of data. ● Designed, implemented, and maintained Autosys job scheduling system for big data pipelines processing millions of records per day. ● Implemented automated workflows using Apache NiFi and Ansible playbooks to ingest and process large volumes of data, improving the efficiency of the data processing pipeline. Built a new CI pipeline. ● Designed and developed data pipelines using Apache Airflow to orchestrate complex ETL processes, improving data quality and reducing processing time by 50%. Testing and deployment automation with Docker, Swamp, Jenkins, and Puppet. Utilized continuous integration and automated deployments with Jenkins, Kubernetes, and Docker. ● Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big Data technologies. ● Architected a scalable CDC framework using Apache Hudi and Kafka Streams to capture and process data changes from various sources, enhancing the accuracy and timeliness of data synchronization across the enterprise data platform. ● Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that processes the data using the SQL Activity. Build an ETL which utilizes spark jar inside which executes the business analytical model. ● Data Integration ingests, transforms, and integrates structured data and delivers data to a scalable data warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of data from various sources into a single data warehouse. ● Expertise in data integration, data transformation, and data quality using SAS Data Integration Studio.

Data Engineer

01.2018 - 06.2020 |Johnson & Johnson

Hadoop, Cloudera, HBase, HDFS, MapReduce, AWS, Atscale, YARN, Hive, Pig, Sqoop, Oozie, Tableau, Java

● Involved in transforming data from Mainframe tables to HDFS, and HBase tables using Sqoop. ● Implemented Data Stage jobs to process and analyze large volumes of mortgage data, enabling data-driven insights into loan performance, borrower behavior, and market trends. ● Visualized the results using dashboards and the Python Seaborn libraries were used for Data interpretation in deployment. Used Rest API to Access HBase data to perform analytics. ● Involved in creating Hive tables, loading with data, and writing Hive queries that will run internally in MapReduce way. ● Experienced with handling administration activations using Cloudera manager. ● Created and maintained technical documentation for launching Hadoop Clusters and for executing Pig Scripts. ● Storing and processing large amounts of data using services such as Google Cloud Storage, Bigtable, and BigQuery. These services can be used to store raw data, process it, and load it into a data warehouse for analysis. ● Build a program using Python and Apache beam to execute it in cloud Dataflow and to run Data validation jobs between raw source file and big query tables. ● Automatically scale-up the EMR instances based on the data using Atscale. ● Development of company´s internal CI system, providing a comprehensive API for CI/CD. ● Developed MapReduce programs to process the Avro files and to get the results by performing some calculations on data and performed map side joins. ● Imported Bulk Data into HBase Using MapReduce programs. ● Used built to store streaming data to HDFS and to implement Spark for faster processing of data. ● Worked on creating the RDD's, DFs for the required input data and performed the data transformations using Spark Python. ● Migrated complex map reduce programs into in memory Spark processing using Transformations and actions. ● Involved in migrating tables from RDBMS into Hive tables using SQOOP and later generate visualizations using Tableau. ● Deep understanding of moving data into GCP using SQOOP process, using custom hooks for MySQL, using cloud data fusion for moving data from Teradata to GCS ● Stored the time-series transformed data from the Spark engine built on top of a Hive platform to Amazon S3 and Redshift. ● Facilitated deployment of multi-clustered environment using AWS EC2 and EMR apart from deploying Dockers for cross-functional deployment. ● Designed and implemented Incremental Imports into Hive tables and writing Hive queries to run on TEZ. ● Created key Mapping with Talend Integration Suite to pull data from Source, apply transformations, and load data into target database. ● Involved in developing Spark SQL queries, Data frames, import data from Data sources, perform transformations, perform read/write operations, save the results to output directory into HDFS. ● Ability to work with large datasets and perform advanced data manipulation, aggregation, and transformation using ANSI SQL. ● Proficient in working with large datasets, including data modeling, database schema design, and data partitioning using Redshift and Snowflake. ● Implemented the workflows using Apache Oozie framework to automate tasks. ● Designed and implemented Incremental Imports into Hive tables. ● Worked with NoSQL databases like HBase in making HBase tables to load expansive arrangements of semi structured data. ● Involved in collecting, aggregating, and moving data from servers to HDFS using Flume. ● Imported and Exported Data from Different Relational Data Sources like DB2, SQL Server, Teradata to HDFS using Sqoop. ● Involved in data ingestion into HDFS using Sqoop for full load and Flume for incremental load on variety of sources like web server, RDBMS, and Data API’s. ● Collected data using Spark Streaming from AWS S3 bucket in near-real- time and performs necessary Transformations and Aggregations to build the data model and persists the data in HDFS.

Hadoop Developer

05.2014 - 09.2017 |NMG Technologies

Hadoop, CDH, MapReduce, Pig, MS SQL Server, SQL Server Business Intelligence Development Studio, Hive, HBase, SSIS, Office, Excel, Flat Files, T-SQL

● The custom File System plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly. ● Extensive use of Expressions, Variables, Row Count in SSIS packages ● jobs in java for data cleaning and pre-processing. ● Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration. ● Creating Hive tables and working on them using Hive QL. Experienced in defining job flows. ● Importing and exporting data into HDFS from Oracle Database and vice versa using Sqoop. ● Created batch jobs and configuration files to create automated process using SSIS. ● Created SSIS packages to pull data from SQL Server and exported to Excel Spreadsheets and vice versa. ● Deploying and scheduling reports using SSRS to generate daily, weekly, monthly, and quarterly reports. ● Involved in creating Hive tables, loading the data, and writing hive queries that will run internally in a map reduce way. ● Running Hadoop clusters on GCP using the Cloud Dataproc service, which allows for easy creation and management of Hadoop and Spark clusters. ● Storing data on GCP using services such as Cloud Storage and Bigtable, which can serve as input for Hadoop jobs. ● Extensive use of cloud shell SDK in GCP to configure/deploy tech services like Dataproc, Storage and Big query. ● Designed and managed the deployment of multiple Hadoop clusters using Ansible playbooks, reducing the time and effort required for manual installation and configuration. ● Developed a custom File System plugin for Hadoop so it can access files on Data Platform. ● Designed and implemented MapReduce-based large-scale parallel relation-learning system. ● Setup and benchmarked Hadoop/HBase clusters for internal use ● Data validation and cleansing of staged input records was performed before loading into Data Warehouse ● Automated the process of extracting the various files like flat/excel files from various sources like FTP and SFTP (Secure FTP). ● Migrated and converted ETL jobs from Informatica and Talend into Google Cloud Dataflow and Dataproc, integrating them into CI/CD pipelines to streamline data processing workflows. ● Designed and delivered data integration and extraction solutions utilizing IBM DataStage and other ETL tools, optimized for performance on Data Warehouse platforms such as Teradata and BigQuery. ● Demonstrated strong expertise in Google Cloud Platform (GCP) tools, particularly Dataflow, BigQuery, and Cloud Composer, to design and implement scalable data processing solutions. ● Leveraged extensive GCP Dataflow, Java/Python, and Apache Beam skills to lead delivery streams, ensuring efficient and reliable data processing pipelines. ● Developed and implemented data warehouse and data lake solutions using BigQuery and other Google Cloud Platform (GCP) services, ensuring efficient data storage and retrieval processes. ● Extensive experience working with streaming and messaging systems such as Kafka, Pulsar, GCP Pub/Sub, and RabbitMQ. Developed and maintained real-time data pipelines, ensuring low-latency and high-throughput data processing. ● Proficient in integrating streaming systems with various data storage solutions, including Cassandra. Implemented connectors and custom integrations to enable seamless data flow between messaging systems and databases. ● Strong familiarity with Google Cloud Platform (GCP) services, including Cloud Storage and Dataproc. Utilized these services for efficient data storage, processing, and analytics, leveraging GCP's scalable infrastructure to support large-scale data engineering projects.

Educational background

Computer Science (Bachelor’s Degree)

Till 2014

SRKR Engineering College

Languages

EnglishAdvanced