Subhalingam D

Data Scientist at KnowDis

View Résumé

About Me

Subhalingam is interested in the broad areas of Natural Language Processing (NLP), Information Retrieval (IR) and Deep Learning. He currently works as a Data Scientist at KnowDis Data Science and holds a B.Tech. degree in Mathematics and Computing from the Indian Institute of Technology, Delhi (IIT Delhi). He has worked on building neural Q&A, machine translation and recommender systems across a variety of domains. He is specifically interested in applying NLP techniques to Indian languages. Apart from coding, one can also find him listening to music, watching football, teaching other people, having some plates of biryani or sleeping.

Education

Indian Institute of Technology, Delhi

B.Tech. in Mathematics and Computing

CGPA: 8.196

Chennai Public School, Chennai

CBSE Std. XII

Marks: 96.4%

Chennai Public School, Chennai

CBSE Std. X

CGPA: 10

Experience

KnowDis Data Science LLP, Delhi


Data Scientist

Product Category Search Engine (for IndiaMART)
  • Observed recall@2 of 94% (+6% than recall@1) and motivated to build a system to rescore top-k categories for improving accuracy
  • Built a reranker that encodes the query & retrieved categories independently, aligns each query token with the most relevant category token and aggregates the similarity scores across the query; the category embeddings are pre-computed offline and cached in memory
  • Revised confidence classification rules, resulting in 82% (+6%) coverage for high-confidence class while maintaining accuracy at 95.5%
  • Attained a 1-2% gain in overall accuracy and currently working on parallelizing the encoding step in the reranker with the retriever

Contextual Query Understanding (for IndiaMART)
  • Developed a two-stage system to identify all the relevant attributes mentioned in a query and extract their corresponding values
  • Trained BART and RoBERTa models using processed product names and specifications data for attributes identification and labelling
  • Formulated a negative sampling strategy and made input layer modifications to tackle incomplete tagging in the data during training
  • Deployed the system using FastAPI and presented a demo to the client; planned to integrate with search system for refining results

English-to-Hindi Translator with Style Restriction
  • Built an mBART-based translation baseline for converting English texts to Hindi in a specified style using in-house parallel corpora
  • Obtained the English translations for scraped monolingual Hindi data using Google Translate API to augment the training data
  • Reviewing existing works on controlling styles in text generation, specifically for low-resource settings, to create improved systems

Other contributions:
  • Explored non-autoregressive generation methods to convert Roman Hindi words in search queries to English to achieve low-latency
  • Experimented with lexical string matching using Elasticsearch to handle model numbers in a search query

KnowDis Data Science LLP, Delhi


Data Science Intern

  • Devised an NLP scheme to predict the most relevant product category (from 113k possible labels) from user queries/product listings
  • Trained a transformer-based classifier on automatically labelled data and added heuristics to improve knowledge of category labels
  • Incorporated causal attention mask, which improved results; fine-tuned T5 model for oversampling under-represented categories
  • Achieved similar accuracy (~88%) as the previous seq2seq model while significantly reducing average response time (3x faster) and completely eliminating timeouts
  • The model was integrated with IndiaMART's search system and was deployed in production

Data Group, Indian Institute of Technology, Delhi


Undergraduate Researcher • Supervised by Prof. Srikanta Bedathur & Prof. Maya Ramanath • In collaboration with IBM AI Horizons Network

  • Prepared a dataset consisting of How-to troubleshooting FAQs by scraping WikiHow pages from Computers and Electronics category
  • Constructed BERT-based baselines to predict changes in properties of the entities involved at each step of the process
  • Surveyed the literature to build next-step recommenderfrom a given sequence of performed actions and developed LSTM baselines

Samsung R&D Institute, Delhi


Software Engineering Intern (S/W Intelligence Team)

  • Developed sound source direction estimation module using time delay of arrival of signals between pairs of microphones in an array
  • Added modules for tracking active sound sources and extracting individual signals for downstream object identification pipeline
  • Integrated stationary noise estimation module for ambient noise removal and reduced maximum direction of arrival error to 7°

Received Pre-Placement Offer for impeccable performance during the internship

MateRate Education Pvt Ltd, Delhi


Machine Learning Researcher & Developer

  • Developed Item Response Theory-based models to estimate and analyze the ability of 5000+ students & difficulty of 200+ questions

Backend Web Developer and AWS Associate

  • Designed database schema and built Web APIs using Django REST framework to display students’ performance reports to parents
  • Deployed Django backend using Elastic Beanstalk with MySQL on RDS and React frontend to S3 with CloudFront CDN integration
  • Set up Auto Scaling group and attached Load Balancerfor horizontal scaling; the portal went live with the results of 5000+ students

Received Letter Of Recommendation from CEO for exemplary work accomplishments

Publications

Tracking entities in technical procedures -- a new dataset and baselines.

Saransh Goyal, Pratyush Pandey, Garima Gaur, Subhalingam D, Srikanta Bedathur, Maya Ramanath. CoRR, 2021.

PDF Code

Activities

Teaching Assisstant

Aug '21 - Dec '21

COL764: Information Retrieval & Web Search
(Graduate-level course taught by Prof. Srikanta Bedathur at IIT Delhi)

General Secretary

Aug '21 - Jul '22

Mathematics Society, IIT Delhi

Overall Coordinator

Jul '20 - Jul '21

Mathematics Society, IIT Delhi

Web Development Executive

Sep '19 - Jul '20

Student Incubation Cell, IIT Delhi

Language Mentor

Aug '19 - Dec '19

Board for Student Welfare (BSW), IIT Delhi

Assisted newcomers regularly to improve their English language communication skill

Executive

Jul '19 - Jul '20

Mathematics Society, IIT Delhi

Volunteer

National Service Scheme (NSS), IIT Delhi

Over 120 hours of community work primarily in Teaching projects

Technical Executive

Aug '19 - Oct '19

Rendezvous, IIT Delhi

Part of the Web Frontend Development team

Projects

Identification of Hate Spreaders on Social Media (Bachelor's Thesis)

Prof. Niladri Chatterjee

We propose a novel model that uses pre-trained word embeddings for encoding the words and incorporates the sentiment scores as weights to mark the importance of the words. It then computes a weighted sum to get the tweet representation and aggregates these to obtain the user representation. The user representation is finally fed to an ML classifier. Our model achieves an accuracy of 76% on the test set and outperforms the best model in the competition.

Ongoing Project

chaii - Hindi and Tamil Question Answering

Prof. Mausam

Fine-tuned XLM-RoBERTa for multilingual Q/A using chaii-1 dataset augmented with MLQA, XQuAD & SQuAD and attained test Jaccard score of 68.72%.

View Project

Context-Sensitive Word Sense Disambiguation

Prof. Mausam

Compared non-contextual and contextual embeddings (GloVe+BiLSTM vs BERT) using WiC dataset for WSD task.

View Project

Tweet Sentiment Classifier

Prof. Mausam

Processed tweets with tweet normalization, internet slang dictionary, stemming, etc.; vectorized with TF-IDF; fed into LR.

View Project

Rule-based Written-to-Spoken Text Converter

Prof. Mausam

Built a regex-based system that accounts for chunks with abbreviations, dates. numerical quantities and inflections. Obtained test F1-score of 97.94%.

View Project

Bankruptcy Prediction

Prof. Niladri Chatterjee

Reviewed state-of-the-art bankruptcy prediction models and observed poor recall. Hypothesized class imbalance & missing values to be the reasons. Trained an ensemble model with Mean Imputation & SMOTE on Polish companies dataset and gained 10% improvement in recall.

View Project

Adaptive Network-based Fuzzy Inference System for Diabetes Prediction

Prof. Niladri Chatterjee

Trained a Takagi–Sugeno type neuro-fuzzy model in TensorFlow for diabetes prediction and obtained accuracy of 81.3%.

View Project

Document Reranking using Pseudo-Relevance Feedback

Prof. Srikanta Bedathur

Used probabilistic query expansion and relevance model based language modeling with unigram/bigram setting & Dirichlet smoothing to rerank retreived documents and improve the MRR and nDCG scores of the system.

View Project

Vector Space Model for News Articles Retrieval

Prof. Srikanta Bedathur

Implemented end-to-end retrieval system indexed with TF-IDF weights & cosine similarity-based ranking. Added prefix searching and named entity based searching (using StanfordNER) to narrow down the results of retreival. Compressed index file by encoding differences between document IDs & reduced size by half (topped class leaderboard for index size).

View Project

Web Designing & Development for SAC, IIT Delhi

Revamped the website using CSS & Javascript for better user experience and easy accessibility & retrieval of information.

Visit Website

Triangulation Topology Analysis using Graph Theory

Prof. Subodh Kumar

Generated generic Graph data structure to store triangles, points & edges for given triangulation topology of 3D shapes. Implemented traversal algorithms to get neighbours, boundary edges, count of connected components & closest components.

View Project

Priority-based Job Scheduler

Prof. Subodh Kumar

Implemented Trie, Red-Black Tree & Max-Heap to execute jobs from users for projects based on priorities & resources. Added features for fetching job status & top budget consuming users, flushing starving jobs & updating project priorities.

View Project

Symbolic Differentiation

Prof. Subhashis Banerjee

Generated a Binary Tree by parsing fully parenthesised infix expression and computed its derivative by traversal. The parser was made to support a variety of functions like algebraic, trigonometric, exponential & composite functions.

View Project

Skills

Contact

Connect

Get in Touch