Chapter 3 Data transformation
3.1 Fields cleaning
- Some of the fields had html elements and we cleaned them using regular expressions
- Some numeric fields were malformed with text. We removed it with regular expressions
- For some years, wage from and wage to were merged in one column with “-” separator. We splitted it in two columns
- Each year had an inconsistent column name. We selected only the columns that were valuable for our graphs and renamed them consistently
- 2020 was split in four quarters, so we merged it to only one
- The employer names were not standard, so we did some normalization by removing accents, leading and trailing whitespaces etc.
- We extracted the first two digits of the SOC codes that describe the job category
- For the NAICS codes we extracted the first two digits that represent the industry sector
- In general, for all codes we translated to the corresponding definition
3.2 Units conversion
- The employees wage was of varying units:hourly, weekly, bi-weekly, monthly, yearly. For this reason we created a yearly and an hourly field. For the conversion of hourly to yearly payment we found that at average a person works for 1800 hours per year (source)[https://stats.oecd.org/Index.aspx?DataSetCode=ANHRS]. For monthly to hourly conversion we found that someone works for 173 hours per month on average. Also, we assumed that an average US employee works 5 days per week for 8 hours per day