Website: [login to view URL]
Data Schema
1. The link contains json data for various data sources. You would have to scan
through and filter any COVID related data
2. Design a schema for every state and store data in the respective tables per
state
3. Apply various indexing technique on PostGres to enable fast searching
DAG performance and efficiency
2. Concepts of Distributed Computing
3. Your choice of schema design 1 NF, 2 NF, 3NF
4. Utilize any async process while performing any loads
5. How would you scale DAG with increase in data volume
6. Logging and monitoring if any failure happens
7. Object oriented design
1. Use Python 3 for developing the solution
2. Utilize Apache Airflow to design a daily dag that would run every day.
3. Create a task within the dag to iterate through the json and download the
locally.
4. Create task to load the files into PostGres Schema
5. Optimize your dag performance by achieving max parallelism locally. You
could utilize parallelism for task, dag concurrency, thread pool or
max_threads
6. Follow the ETL process of Extract, Transform and Load
7. Each dag task should be independent and should be able to run individually.
8. Implement unit or integration test
9. Containerize your application inside a docker container. Use docker-compose
if required
I am excellent at python and in other languages also. I worked on many projects like websraping, python automation and data analytics...
I will work for this project because it's very easy task for me...
please discuss your further project over chat..
I've experience devploying Airflow platform (Kubernetes executor + Postgresql main repo) to manage analytical processing mainly with pandas. I have read you project description carefully , docker containerization could be used as well instead of dedicated operators as proposed .