Wafer Fault Detection - Wafer circleci with python

Overview

Wafer Fault Detection

Problem Statement:

Wafer (In electronics), also called a slice or substrate, is a thin slice of semiconductor,
such as a crystalline silicon (c-Si), used for fabricationof integrated circuits and in photovoltaics,
to manufacture solar cells.

The inputs of various sensors for different wafers have been provided.
The goal is to build a machine learning model which predicts whether a wafer needs to be replaced or not
(i.e whether it is working or not) nased on the inputs from various sensors.
There are two classes: +1 and -1.
+1: Means that the wafer is in a working condition and it doesn't need to be replaced.
-1: Means that the wafer is faulty and it needa to be replaced.

Data Description

The client will send data in multiple sets of files in batches at a given location.
Data will contain Wafer names and 590 columns of different sensor values for each wafer.
The last column will have the "Good/Bad" value for each wafer.

Apart from training files, we laso require a "schema" file from the client, which contain all the
relevant information about the training files such as:

Name of the files, Length of Date value in FileName, Length of Time value in FileName, NUmber of Columnns, 
Name of Columns, and their dataype.

Data Validation

In This step, we perform different sets of validation on the given set of training files.

Name Validation: We validate the name of the files based on the given name in the schema file. We have 
created a regex patterg as per the name given in the schema fileto use for validation. After validating 
the pattern in the name, we check for the length of the date in the file name as well as the length of time 
in the file name. If all the values are as per requirements, we move such files to "Good_Data_Folder" else
we move such files to "Bad_Data_Folder."

Number of Columns: We validate the number of columns present in the files, and if it doesn't match with the
value given in the schema file, then the file id moves to "Bad_Data_Folder."

Name of Columns: The name of the columns is validated and should be the same as given in the schema file. 
If not, then the file is moved to "Bad_Data_Folder".

The datatype of columns: The datatype of columns is given in the schema file. This is validated when we insert
the files into Database. If the datatype is wrong, then the file is moved to "Bad_Data_Folder."

Null values in columns: If any of the columns in a file have all the values as NULL or missing, we discard such
a file and move it to "Bad_Data_Folder".

Data Insertion in Database

 Database Creation and Connection: Create a database with the given name passed. If the database is already created,
 open the connection to the database.
 
 Table creation in the database: Table with name - "Good_Data", is created in the database for inserting the files 
 in the "Good_Data_Folder" based on given column names and datatype in the schema file. If the table is already
 present, then the new table is not created and new files are inserted in the already present table as we want 
 training to be done on new as well as old training files.
 
 Insertion of file in the table: All the files in the "Good_Data_Folder" are inserted in the above-created table. If
 any file has invalid data type in any of the columns, the file is not loaded in the table and is moved to 
 "Bad_Data_Folder".

Model Training

 Data Export from Db: The data in a stored database is exported as a CSV file to be used for model training.
 
 Data Preprocessing: 
    Check for null values in the columns. If present, impute the null values using the KNN imputer.
    
    Check if any column has zero standard deviation, remove such columns as they don't give any information during 
    model training.
    
 Clustering: KMeans algorithm is used to create clusters in the preprocessed data. The optimum number of clusters 
 is selected

Create a file "Dockerfile" with below content

FROM python:3.7
COPY . /app
WORKDIR /app
RUN pip install -r requirements.txt
ENTRYPOINT [ "python" ]
CMD [ "main.py" ]

Create a "Procfile" with following content

web: gunicorn main:app

create a file ".circleci\config.yml" with following content

> $BASH_ENV echo 'export IMAGE_NAME=python-circleci-docker' >> $BASH_ENV python3 -m venv venv . venv/bin/activate pip install --upgrade pip pip install -r requirements.txt - save_cache: key: deps1-{{ .Branch }}-{{ checksum "requirements.txt" }} paths: - "venv" - run: command: | . venv/bin/activate python -m pytest -v tests/test_script.py - store_artifacts: path: test-reports/ destination: tr1 - store_test_results: path: test-reports/ - setup_remote_docker: version: 19.03.13 - run: name: Build and push Docker image command: | docker build -t $DOCKERHUB_USER/$IMAGE_NAME:$TAG . docker login -u $DOCKERHUB_USER -p $DOCKER_HUB_PASSWORD_USER docker.io docker push $DOCKERHUB_USER/$IMAGE_NAME:$TAG deploy: executor: heroku/default steps: - checkout - run: name: Storing previous commit command: | git rev-parse HEAD > ./commit.txt - heroku/install - setup_remote_docker: version: 18.06.0-ce - run: name: Pushing to heroku registry command: | heroku container:login #heroku ps:scale web=1 -a $HEROKU_APP_NAME heroku container:push web -a $HEROKU_APP_NAME heroku container:release web -a $HEROKU_APP_NAME workflows: build-test-deploy: jobs: - build-and-test - deploy: requires: - build-and-test filters: branches: only: - main ">
version: 2.1
orbs:
  heroku: circleci/[email protected]
jobs:
  build-and-test:
    executor: heroku/default
    docker:
      - image: circleci/python:3.6.2-stretch-browsers
        auth:
          username: mydockerhub-user
          password: $DOCKERHUB_PASSWORD  # context / project UI env-var reference
    steps:
      - checkout
      - restore_cache:
          key: deps1-{{ .Branch }}-{{ checksum "requirements.txt" }}
      - run:
          name: Install Python deps in a venv
          command: |
            echo 'export TAG=0.1.${CIRCLE_BUILD_NUM}' >> $BASH_ENV
            echo 'export IMAGE_NAME=python-circleci-docker' >> $BASH_ENV
            python3 -m venv venv
            . venv/bin/activate
            pip install --upgrade pip
            pip install -r requirements.txt
      - save_cache:
          key: deps1-{{ .Branch }}-{{ checksum "requirements.txt" }}
          paths:
            - "venv"
      - run:
          command: |
            . venv/bin/activate
            python -m pytest -v tests/test_script.py
      - store_artifacts:
          path: test-reports/
          destination: tr1
      - store_test_results:
          path: test-reports/
      - setup_remote_docker:
          version: 19.03.13
      - run:
          name: Build and push Docker image
          command: |
            docker build -t $DOCKERHUB_USER/$IMAGE_NAME:$TAG .
            docker login -u $DOCKERHUB_USER -p $DOCKER_HUB_PASSWORD_USER docker.io
            docker push $DOCKERHUB_USER/$IMAGE_NAME:$TAG
  deploy:
    executor: heroku/default
    steps:
      - checkout
      - run:
          name: Storing previous commit
          command: |
            git rev-parse HEAD > ./commit.txt
      - heroku/install
      - setup_remote_docker:
          version: 18.06.0-ce
      - run:
          name: Pushing to heroku registry
          command: |
            heroku container:login
            #heroku ps:scale web=1 -a $HEROKU_APP_NAME
            heroku container:push web -a $HEROKU_APP_NAME
            heroku container:release web -a $HEROKU_APP_NAME

workflows:
  build-test-deploy:
    jobs:
      - build-and-test
      - deploy:
          requires:
            - build-and-test
          filters:
            branches:
              only:
                - main

to create requirements.txt

pip freeze>requirements.txt

initialize git repo

git push -u origin main ">
git init
git add .
git commit -m "first commit"
git branch -M main
git remote add origin 
   
    
git push -u origin main

   

create a account at circle ci

Circle CI

setup your project

Setup project

Select project setting in CircleCI and below environment variable

DOCKERHUB_USER
DOCKER_HUB_PASSWORD_USER
HEROKU_API_KEY
HEROKU_APP_NAME
HEROKU_EMAIL_ADDRESS
DOCKER_IMAGE_NAME=wafercircle3270303

to update the modification

git add .
git commit -m "proper message"
git push 
Owner
Avnish Yadav
Avnish Yadav
Pyspark Spotify ETL

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data

16 Jun 09, 2022
Candlestick Pattern Recognition with Python and TA-Lib

Candlestick-Pattern-Recognition-with-Python-and-TA-Lib Goal Look at the S&P500 to try and get a better understanding of these candlestick patterns and

Ganesh Jainarain 11 Oct 07, 2022
Flenser is a simple, minimal, automated exploratory data analysis tool.

Flenser Have you ever been handed a dataset you've never seen before? Flenser is a simple, minimal, automated exploratory data analysis tool. It runs

John McCambridge 79 Sep 20, 2022
:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges: Optimus is the missing framework to prof

Iron 1.3k Dec 30, 2022
Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles

Correlation-Study-Climate-Change-EV-Adoption Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles I

Jonathan Feng 1 Jan 03, 2022
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

Blue Collar Bioinformatics 917 Jan 03, 2023
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 09, 2023
A Python package for modular causal inference analysis and model evaluations

Causal Inference 360 A Python package for inferring causal effects from observational data. Description Causal inference analysis enables estimating t

International Business Machines 506 Dec 19, 2022
Toolchest provides APIs for scientific and bioinformatic data analysis.

Toolchest Python Client Toolchest provides APIs for scientific and bioinformatic data analysis. It allows you to abstract away the costliness of runni

Toolchest 11 Jun 30, 2022
A pipeline that creates consensus sequences from a Nanopore reads. I

A pipeline that creates consensus sequences from a Nanopore reads. It clusters reads that are similar to each other and creates a consensus that is then identified using BLAST.

Ada Madejska 2 May 15, 2022
The micro-framework to create dataframes from functions.

The micro-framework to create dataframes from functions.

Stitch Fix Technology 762 Jan 07, 2023
Making the DAEN information accessible.

The purpose of this repository is to make the information on Australian COVID-19 adverse events accessible. The Therapeutics Goods Administration (TGA) keeps a database of adverse reactions to medica

10 May 10, 2022
In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

Mobeen Ahmed 1 Nov 01, 2021
Python package to transfer data in a fast, reliable, and packetized form.

pySerialTransfer Python package to transfer data in a fast, reliable, and packetized form.

PB2 101 Dec 07, 2022
PyIOmica (pyiomica) is a Python package for omics analyses.

PyIOmica (pyiomica) This repository contains PyIOmica, a Python package that provides bioinformatics utilities for analyzing (dynamic) omics datasets.

G. Mias Lab 13 Jun 29, 2022
A probabilistic programming language in TensorFlow. Deep generative models, variational inference.

Edward is a Python library for probabilistic modeling, inference, and criticism. It is a testbed for fast experimentation and research with probabilis

Blei Lab 4.7k Jan 09, 2023
Jupyter notebooks for the book "The Elements of Statistical Learning".

This repository contains Jupyter notebooks implementing the algorithms found in the book and summary of the textbook.

Madiyar 369 Dec 30, 2022
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Dec 31, 2022
A columnar data container that can be compressed.

Unmaintained Package Notice Unfortunately, and due to lack of resources, the Blosc Development Team is unable to maintain this package anymore. During

944 Dec 09, 2022