A certain foundry is producing some automotive parts and they would like to know the casting temperature, silicon content and furnace pressure relevant to each part, for a specific period of production.
This data has been collected in the following dataframes:
production_data_df: contains the unique_part_identifer, cycle_start_timestamp and PART_TYPE
pressure_data_df: contains the pressure profile recorded by the pressure_sensor
temperature_data_df: contains the casting Temperature
silicon_data_df: contains the furnace_silicon_content
Objective:
The objective is to build a unified view function that will extract all relevant information for each part produced.
You will have to build a dataframe, a ground truth, where each row represents a single part and each column represents the relevant data from each process that needs to be extracted and aligned.
About the dataset
production_data_df: Parts are produced per batch of the same product type. A part is produced every 30 minutes and has a unique part identifier. Batches of the same product type are identified with the product type name which is logged on the control system when the type changes and will apply from the next part.
pressure_data_df: During casting, the pressure sensor logs the exerted pressure every 10 seconds (the pressure increases during this time), until the casting is completed. The result is a pressure cycle that lasts for ~30 minutes and resets when casting is completed. You will need to extract the maximum pressure reached for each cycle and the time elapsed (in minutes) from when to production cycle starts to the moment this peak is reached
temperature_data_df: The casting temperature is recorded at the beginning of the casting cycle (in the first 10 minutes)
silicon_data_df: Bulk quantities of metal are melted in the furnace, and as a result, the chemistry remains relatively similar for a few hours of production, and is therefore recorded less frequently. The silicon in particular is recorded every ~4 hours, and should apply for the next 4 hours of production.
1. Load data
library(arrow)library(tidyverse)library(dplyr)library(tidyr)library(here)library(janitor)library(reticulate)library(knitr)library(lubridate)## Casting temperature datacasting_df <-read_parquet(here("data", "casting_temperature_data.parquet")) %>%clean_names() %>%select(everything(), timestamp = index_level_0) %>%mutate(timestamp =ymd_hms(timestamp))## Furnace silicon datafurnace_df <-read_parquet(here("data", "furnace_silicon_data.parquet")) %>%clean_names() %>%select(everything(), timestamp = index_level_0) %>%mutate(timestamp =ymd_hms(timestamp))## Pressure datapressure_df <-read_parquet(here("data", "pressure_data.parquet")) %>%clean_names() %>%select(everything(), timestamp = index_level_0) %>%mutate(timestamp =ymd_hms(timestamp))## Production logging dataproduction_df <-read_parquet(here("data", "production_logging_data.parquet")) %>%clean_names() %>%select(everything(), timestamp = index_level_0) %>%mutate(timestamp =ymd_hms(timestamp))cat("We have 4 datasets and below are their dimensions: \n","- Casting temperature data", nrow(casting_df), "columns and ", ncol(casting_df),"variables, \n","- Furnace silicon data", nrow(furnace_df), "columns and ", ncol(furnace_df),"variables, \n","- Pressure data", nrow(pressure_df), "columns and ", ncol(pressure_df),"variables and,\n","- Production logging data",nrow(production_df),"columns and ",ncol(production_df),"variables. \n")
We have 4 datasets and below are their dimensions:
- Casting temperature data 49 columns and 2 variables,
- Furnace silicon data 6 columns and 2 variables,
- Pressure data 8641 columns and 2 variables and,
- Production logging data 51 columns and 4 variables.
2. Implementation
2.1. Block 1: Preprocess “production data”
We just fill in the part_type based on the setup initial part and we remove the two lines of the setup.
Tip
We just have to be sure that the final dataframe has 49 rows and 4 columns with no NA values as it must be a dense dataframe.
# A tibble: 49 × 3
cycle_start_time max_pressure time_to_max
<dttm> <dbl> <drtn>
1 2024-01-15 02:00:00 51.5 17.66667 mins
2 2024-01-15 02:30:00 52.0 20.16667 mins
3 2024-01-15 03:00:00 50.5 21.83333 mins
4 2024-01-15 03:30:00 50.3 17.50000 mins
5 2024-01-15 04:00:00 50.8 16.66667 mins
6 2024-01-15 04:30:00 51.8 20.66667 mins
7 2024-01-15 05:00:00 51.3 20.33333 mins
8 2024-01-15 05:30:00 50.8 16.66667 mins
9 2024-01-15 06:00:00 53.3 20.00000 mins
10 2024-01-15 06:30:00 50.3 20.33333 mins
# ℹ 39 more rows
import pandas as pdpressure_df = r.pressure_dfpressure_df['timestamp'] = pd.to_datetime(pressure_df['timestamp'])## create the cycle_numberpressure_df['cycle_number'] = (pressure_df['timestamp'] - pressure_df['timestamp'].min()).dt.total_seconds() // (30*60) +1#print(pressure_df['cycle_number'].unique())# cycle start timepressure_df['cycle_start_time'] = pressure_df.groupby('cycle_number')['timestamp'].transform('min')# Max pressure in a cyclepressure_df['max_pressure'] = pressure_df.groupby('cycle_number')['pressure_sensor'].transform('max')# Isolate the rows where the pressure is equal to the max pressure for each cycledf_max_pressure = pressure_df[pressure_df['pressure_sensor'] == pressure_df['max_pressure']]# Calculate the 'time_to_max' in minutes from cycle_start_time to the timestamp where max_pressure occursdf_max_pressure['time_to_max'] = (df_max_pressure['timestamp'] - df_max_pressure['cycle_start_time']).dt.total_seconds() /60# finalpressure_clean = df_max_pressure[['cycle_start_time', 'max_pressure', 'time_to_max']]print(pressure_clean.head(10))
Here we will just perform a join and keep relevant factor.
Tip
The variables onto which we must do the join, on the contrary of previous join, are not exactly the same. For example timestamp from production is 02:00:00 and timestamp from casting is 02:00:13. The pattern repeats itself with the second being equal or slightly higher than the second.
Here we review the whole process and fix some errors if any.
Here are our final data, unified version to be placed into a function. We have 5 NA values in the casting_temperature variable corresponding the original NA values in casting_df