Find Jobs
Hire Freelancers

Data science python - implement feature selection from large CSV

$50-140 USD

En curso
Publicado hace más de 4 años

$50-140 USD

Pagado a la entrega
Implement feature selection from a very large dimensionality dataset: You need to implement one function in python. The input to the function is: 1. string - Path to a csv file 2. integer - n_dimensions - Desired number of output dimensions The function needs to select the best n_dimensions by fitting RandomForest **iteratively** classifier and selecting the features based on feature_importances_ . See: [login to view URL] IMPORTANT: 1. The size of the csv is very large: might be >5gb and the number of dimensions >1M. Therefore, your main task in this project is to find the best chunk size to read the data from the CSV and train the RandomForest classifier, so the function will work in any circumstances, even with very large input files. In other words, if the number of output dimensions is 10K and the number of input dimensions is 1M, the optimal way might be to fit 20 RandomForests, each with 50K, select 10K from each. Then - merge the 200K dimensions selected in the first iteration, and fit another 4 RandomForests, each with 50K. In the last iteration, fit one RandomForest with 40K and finally select the last 10K. 2. Regarding how to select how many iterations and how many random forests to fit: These could be extra parameters to the function. I also expect you to research and recommend best parameters based on execution time and memory usage. The function needs to run well on the client side which is normally a windows 10 pc with 4-8Gb RAM and 2-4 cores. Reading the input into numpy arrays should be done smartly, without loading the entire csv to memory. (maybe numpy has this required functionality built-in) 3. The function should support both parallel and serial modes. In parallel mode, the function should utilize N cores of the PC, by fitting several RandomForests on parallel, each on a separate core. In serial mode the function should fit the RandomForests one by one. 4. The function should be able to execute on python 3.5 environment on both windows. You must test it on windows! 5. The function should output to the console its progress 6. I will send example csv to coders with good experience
ID del proyecto: 23258052

Información sobre el proyecto

18 propuestas
Proyecto remoto
Activo hace 4 años

¿Buscas ganar dinero?

Beneficios de presentar ofertas en Freelancer

Fija tu plazo y presupuesto
Cobra por tu trabajo
Describe tu propuesta
Es gratis registrarse y presentar ofertas en los trabajos
18 freelancers están ofertando un promedio de $111 USD por este trabajo
Avatar del usuario
I am a Python data science expert with experience in Classification and partitioning,Nureal Networks, Association rules. Also I am a Oracle Certified professional (OCP) with experience in Oracle,MySQL, SQL Server and MongoDB. I can help you with your requirements. Please initiate online chat.. so we can proceed with discussion.
$140 USD en 7 días
4,9 (48 comentarios)
6,0
6,0
Avatar del usuario
Hello there, I have read through your project description. I can help you complete this project. I will be looking forward to hear from you. Please contact me on PM for details.
$150 USD en 7 días
4,6 (53 comentarios)
6,1
6,1
Avatar del usuario
Hello, I am an MSC majored in mathematics. I have rich exp in ACM/ICPC and deep understandings of algorithms. I attended at ACM regional contest several times and won medals there. I am an MSc majored in mathematics. (Probability and Statistics in detail) I won medals in IMO (International Mathematics Olympiad). I also have experience in online algorithm contests such as Codechef and Hackerrank. Consultation is also welcome. I have have rich experience in mathematical and algorithm problems. I can help you get insights on the data you described. Master of Mathematics Algorithm(exp. in ACM/ICPC)/Artificial intelligence(AI) Machine Learning (Neural Networks), Deep Learning C++/C#/Matlab/R/Python/java I will be more than excited to provide you a quality solution and earn your respect, confidence and trust
$50 USD en 1 día
4,6 (20 comentarios)
5,3
5,3
Avatar del usuario
*** Feature Selection using PCA or SVC in Python *** Thank you for your attention. I read the project description with interest. I strongly believe I am the proper candidate. I fit all the requirements you mentioned, including Python. I can help you kindly and full time. Please check my profile and past reviews. Let's progress further to get the outstanding Results for you. All the Best. From HongYue Jin.
$100 USD en 7 días
5,0 (15 comentarios)
4,6
4,6
Avatar del usuario
I am a Machine learning expert. Language : ----- C++, C#, Python, Qt, Matlab, Java----- Skill : Machine learning, Deep learning Image processing(OpenCV, OpenGL...) Video codec processing(H264/265, Mpeg4, YUY2...) Database(MySQL, Access, Excel, MSSQL...) Project reversing, Multi-threading, System management
$100 USD en 7 días
5,0 (3 comentarios)
4,1
4,1
Avatar del usuario
Dear sir. I read your project description very carefully. I've really rich experience in Machine Learning & Python, so your project is very interesting to me. I'm really confident about your project, and very eager to join it. If you give me a chance, I'll do my best to provide wonderful result. I believe this will be a good starting point of the business relation between us. Looking forward to your response. Thank you.
$140 USD en 7 días
5,0 (3 comentarios)
2,9
2,9
Avatar del usuario
Hello, I am very happy to put my bid on your project. I have read your proposal and check the attached files and I am very interested in your project. I have good experience in Machine Learning and Python, C# for several years. I am sure I can do this project with good results on time. I am always ready for you. Please feel free anytime. Thanks. Best Regarding...
$100 USD en 3 días
5,0 (1 comentario)
2,6
2,6
Avatar del usuario
Hi, I am a Data Scientist and expert in python with experience of 5 years. Please message to discuss.
$120 USD en 5 días
5,0 (1 comentario)
0,5
0,5
Avatar del usuario
Hello, I have seen your requirement regarding about your project and analysed to have this opportunity and assist you across your project. I have strong expertise to accomplish this project in decided time frame as your project suit to my skill. I provide a quality work and support.I assure you a best quality work and support in future. Initiate chat for further discussion and will show you my project done and my experience toward the project.  Please leave a message with your available timings if I'm Offline. I will reach you ASAP Thank you.
$95 USD en 2 días
0,0 (0 comentarios)
0,0
0,0
Avatar del usuario
Greetings, I am an expert data scientist and I am programming with python for more than two years of experience. I assure you that I can do your project and deliver it back to you with a high-quality outcome. I went through your description. This is something I can manage for you. Hope to hear from you soon. Navid.
$95 USD en 3 días
5,0 (1 comentario)
0,1
0,1
Avatar del usuario
i have worked in feature engineering. like how to select a best features in our model ?? how can handle missing value ?? how can handle categorical value?? etc...
$95 USD en 10 días
0,0 (0 comentarios)
0,0
0,0
Avatar del usuario
I really like the skills in the field of data science and artificial intelligence, especially related to Microsoft Excel and big data. I have been in this field for 7 years. I like activities in organizations and now in the field of human resources.
$166 USD en 3 días
0,0 (0 comentarios)
0,0
0,0
Avatar del usuario
Hi there, I can implement the function in Python. Please share the example CSV. I'm a full-stack developer with extensive knowledge of Python with an experience of 7+ years. Here's a sample of Backend/DevOps tasks I can help you with: - Design and implement REST APIs in Flask/Postgres, documented with Swagger, with graphql support (using graphene) - Dockerise applications and deploy them to AWS ECS, setting up load balancing and auto scaling - Move infrastructures to code using Terraform. - Integrate apps in AWS EC2, RDS, S3, ElasticCache ... - Build CI/CD pipelines for testing and automation, recommending best practices like semantic versioning and changelog automation. - Implement Monitoring/Logging for Datadog and ElasticCache, including custom instrumentation for APM. - Automate existing workflows in Python Let's connect to discuss the details. Regards, Mishal
$140 USD en 7 días
0,0 (0 comentarios)
0,0
0,0
Avatar del usuario
Dear sir. I am a Python expert. I've got many experience in processing data using python language. And also I am very familiar with ML. I've carefully read your description and I am sure I can help you perfectly. Hope to meet and have a talk. Thank you.
$50 USD en 7 días
0,0 (0 comentarios)
0,0
0,0
Avatar del usuario
Hello. I'd like to pay attention to a few questions, that I find crucial here. 1. You use phrase "the best n_dimensions" that might be interpreted in many ways. What metric we consider the best in this task? 2. How we measure goodness of fit here? In this task I see no other subsets, so my guess is that model perfomance should be evaluated based on local validation. How many local folds should be used to evaluate perfomance? (based on number of folds you might get different results). 3. Parallelization and progress output. I believe, you assume using standard sklean RandomForest as backbone here. It doesn't support parallelization the way you want - it support parallelization only while building single RandomForest. (i.e. you build 1 RF using all thread, then 2 etc.) 4. Dimensions and tech requirements. That's very difficult to deal with, since approach on 4Gb RAM/Win10 with 1M dimentions with no provided dtypes can vary a lot. It will be nice if you know the dtypes of dimentions, whether it is sparse or not, etc. 5. It is not very clear what you want in this task. Do you want a function that removes useless (or barely useful) features from dataset and give you only top-k features? Here again goodness of fit (from point 2) arises. Based on different subsets different results might occur. So, my advice is to work closely with the one you choose on this questions and find consensus for better result. Kind regards.
$140 USD en 7 días
0,0 (0 comentarios)
0,0
0,0

Sobre este cliente

Bandera de GERMANY
Bergisch Gladbach, Germany
5,0
271
Forma de pago verificada
Miembro desde ago 27, 2004

Verificación del cliente

¡Gracias! Te hemos enviado un enlace para reclamar tu crédito gratuito.
Algo salió mal al enviar tu correo electrónico. Por favor, intenta de nuevo.
Usuarios registrados Total de empleos publicados
Freelancer ® is a registered Trademark of Freelancer Technology Pty Limited (ACN 142 189 759)
Copyright © 2024 Freelancer Technology Pty Limited (ACN 142 189 759)
Cargando visualización previa
Permiso concedido para Geolocalización.
Tu sesión de acceso ha expirado y has sido desconectado. Por favor, inica sesión nuevamente.