
Pune, India - 2018-2021
- #rest
- #cloud
- #connectors
- #improvement
- #xlsx
- #extra
- #spark
- #linux
- #fromScratch
- #nosql
- #rdbms
- #poc
- #devOps
- #file
- #crm
- #reset
- Designed and developed REST APIs on Java for services like connector service - pull and push data from/to any source, and scanner service - to scan for data structure; to be run as Spark cluster jobs for ETL.
- Worked on cloud connectors like Azure Blob, Amazon S3, GCP; with a major focus on Azure Blob.
- Improved performance of XLSX data parser by ~50%, by identifying bottlenecks in Apache Poi implementation. Designed, implemented, and maintained new features like range read, streaming read, etc. for xlsx connector.
- Developed bash scripts on Linux for doing automated performance benchmarking for various connectors for Spark on K8S.
- Designed and developed functionality like upsert for RDBMS via spark, mongoDb authentication.
- Researched the feasibility of a connector for Google Drive. Developed a POC to demo the integrated functionality.
- Worked on other technologies like Docker, Kubernetes, Azure Blob, Amazon S3, Sftp, Spring boot, CI/CD, Snowflake, SonarQube, ElasticSearch. Strong experience in Linux, with Ubuntu as the choice of development OS.
- Contributed to maintenance and enhancement of Spark cluster processing for about 15 connectors.
- Worked on file connectors like Delimited, Json, Xlsx, Parquet, Xml, Sas7bdat, etc..
- Hands on experience working on data parsers(connectors) for various sources like RDBMS systems (PostgreSQL, MySQL, Hive, SqlServer, Snowflake, etc.); file formats like xlsx, delimited, sas7bdat, json, xml; NoSQL like mongoDb; and crm platforms like Salesforce.
XLSX connector performance improvement
Show more- Apache POI was the best for working on xlsx, but it was the slowest impl when it came to files of size > 50MB. I was heavily involved in identifying the bottlenecks at the worksheet object creation, looking for alternatives, and poc on different libraries usage. Firsthand learning using jvisualVM for analysing JVMs.
- Integrated a streaming based approach for reading huge xlsx files. Ensured all custom features worked with newer approach.
JSON connector performance improvement
Show more- Involved in refactoring Json reading impl 3 times, with the last refractor was on top of Spark's own impl. Sparks impl caused many issues with multiLine. Our impl was lighter, used Gson streaming, worked on spark cluster, reading json as bytes line by line to create a Dataset.
- This approach was the best, as it worked on multi-line json, jsonL, and beautified json perfectly.
Git
Show more- One of the maintainer of our repo, solving all git issues. Used a linear history model, which was easier to comprehend.
Custom features on XLSX/CSV
Show more- Implemented all custom features on xlsx, delimited files. Language support and encoding caused some issues when spark/non-spark libraries were used. But, I ensured the impl were as consistent as possible.
Refractor for better design
Show more- Designed and implemented better single responsibility for present class structures, while implementing advance connection url for jdbc(HA for SqlServer) with all SQL connectors. It allowed better creation of customized connection url string while creation of specific SQL connectors.
Bash Scripts
Show more- Bash scripting is my personal favourite. I started to write bash scripts for almost all mundane jobs. Honorary mention - while doing performance benchmarking, I developed a script that will spawn pods(Kubernetes) of different configs, for specified connector type, and sleep in the background. It will periodically check the pods completion status, and once done, it will grep the log, and prune the timings to present the result. EOD - the performance timings were ready for me in a csv. All this thanks to the support team not tightly controlling the environment.
- Wrote another bash script for pushing our latest code on support's git env for deployment only based on image tags. The script would "correctly" rebase main, and push our new tags on the deploy file, and push with a consistent commit message. Also, handled scenarios where there would be conflicts on pull/push. With the script, you just need to have the image tags ready, don't need to interact with git.
Tech Stack Used
- Java
- SpringBoot
- Docker
- Kubernetes
- MySQL
- PostgreSQL
- Snowflake
- Hive
- MongoDB
- Azure Blob
- S3
- GCP
- Spark
- jvisualVM
- bash
- Git
- Sublime3
Icon Credits
LTI logo created by https://companieslogo.com/ - https://companieslogo.com/larsen-toubro-infotech/logo/Xlsx icons created by Creativenoys01 - Flaticon - https://www.flaticon.com/free-icons/xlsxJson icons created by Smashicons - Flaticon - https://www.flaticon.com/free-icons/jsonGit icons created by pictogramer - Flaticon - https://www.flaticon.com/free-icons/gitSteam icons created by Freepik - Flaticon - https://www.flaticon.com/free-icons/steamLinux icons created by Abu Shafiyya - Flaticon - https://www.flaticon.com/free-icons/linux