Chapter 2 Data sources

To conduct our analysis we used the database of Labor Condition Applications (LCA’s), that employers file with the US Department of Labor Employment and Training Administration (ETA). The data is generated by the Office of Foreign Labor Certification and can be found here: https://www.dol.gov/agencies/eta/foreign-labor/performance.

2.1 How is the data collected and what kind of information can be found?

The LCA’s are forms that employers must fill. Then the responses are extracted and stored in a database.

These forms contain information about the employer and the employee. Some variables of interests are:

  • Employer name
  • Employer location
  • Employer industry (e.g., tech, finance)
  • job title and job category of the employee

The data is spread across multiple files, one from each year from 2013 to 2020. Each file contains approximately 600,000 rows and more than 100 columns, mostly categorical. However, most of these columns contain too many missing values for them to be considered in our analysis.

##Main issues with the data

One of the main problems with the data is that the files for each year were didn’t have the same structure. Across files, we detected the following problems:

  • Different number of columns per file
  • Different column names
  • Different file formats (in some cases, the data of a full year was split by quarter)

Additionally, since our data comes from forms filled manually by employers, there were occasional typos and inconsistencies in the data. For example, the company Microsoft might appear with two different names, “Microsoft Corporation” and “Microsoft Corporation.”, the only difference being the “.” at the end. Fortunately, most inconsistencies were easy to deal with with simple cleaning techniques.

In some other cases, the anomalies were so rare that was not worth cleaning and could be ignored given the amount of data we had. One example is the column employer state, which in some rare circumstances contained non-existent state codes.

2.2 Clarification on LCA status

The dataset contains a column called “case status” which can take values like “certified” or “denied.” It is tempting to think that this status defines the final approval or denial of the H-1B, but it’s not. A certified LCA is a prerequisite to obtaining an H-1B approval. It means that the Department of Labour (DOL) approved the LCA, and a petition is submitted to the United States Citizenship and Immigration Services (USCIS), which make the final call.

2.3 Auxilary datasets

Even though the principal data source comes from the LCA’s, we used additional data sources to understand the data better. This is because most of the categorical variables are coded. One example is the NAICS code which represents the employer’s industry, and it’s a four to six-digit number. To understand what these codes mean we downloaded an official dictionary of the codes from the census website.

In general, here are the other datasets used and their purpose:

Data sources: